Troubleshooting¶

Common problems and solutions for local LLM deployment.

Model Loading Issues¶

Model File Not Found¶

Error: model file not found

Causes: - Incorrect path - Model not downloaded - Wrong mount in Docker

Solutions:

# Verify file exists
ls -la /path/to/model.gguf

# For Docker, check mount
docker exec ollama ls -la /root/.ollama/models

# Ollama: re-pull model
ollama pull llama3.3:70b

Invalid Model Format¶

Error: invalid model format

Causes: - Corrupted download - Wrong format for engine - Incompatible quantization

Solutions:

# Re-download
ollama rm llama3.3:70b
ollama pull llama3.3:70b

# For GGUF, verify file
file model.gguf
# Should show: GGUF model data

Model Too Large¶

Error: CUDA out of memory
Error: failed to mmap model

Causes: - Insufficient GPU memory - Model exceeds available RAM

Solutions:

# Use smaller quantization
ollama pull llama3.3:70b-instruct-q4_K_S  # Smaller than Q4_K_M

# Reduce GPU layers
./llama-server -m model.gguf -ngl 30  # Partial offload

# Reduce context length
./llama-server -m model.gguf -c 4096

GPU Issues¶

GPU Not Detected¶

No GPU detected
Using CPU backend

NVIDIA Solutions:

# Check driver
nvidia-smi

# Check CUDA
nvcc --version

# Docker: Verify nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

# Reinstall toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

AMD Solutions:

# Check ROCm
rocminfo
rocm-smi

# Docker: Check device access
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/rocm-terminal rocminfo

GPU Memory Exhausted¶

CUDA error: out of memory

Solutions:

# Check current usage
nvidia-smi

# Unload unused models
ollama stop other-model

# Use smaller model/quantization
ollama run llama3.3:70b-instruct-q3_K_M  # Instead of Q4_K_M

# Reduce context
--num-ctx 4096  # Instead of default

Slow GPU Performance¶

Tokens/sec much lower than expected

Causes: - Not all layers on GPU - Thermal throttling - Power management

Solutions:

# Verify GPU layers
# Look for "GPU layers: 99" in startup log

# Check temperature
nvidia-smi -q -d TEMPERATURE

# Check power state
nvidia-smi -q -d PERFORMANCE

# Force full GPU offload
-ngl 99

API Issues¶

Connection Refused¶

Error: connection refused
curl: (7) Failed to connect

Solutions:

# Check service is running
systemctl status ollama
docker ps | grep ollama

# Check binding
ss -tuln | grep 11434

# Check firewall
sudo ufw status

# Bind to all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve

Timeout on Requests¶

Error: request timeout

Causes: - Model loading slowly - Large prompt - Slow hardware

Solutions:

# Increase timeout in client
curl --max-time 300 http://localhost:11434/api/generate

# Pre-load model
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3:70b", "keep_alive": "1h"}'

# Check model is loaded
ollama ps

Empty or Truncated Response¶

Response has no content or stops abruptly

Causes: - max_tokens too low - Stop token triggered - Context overflow

Solutions:

# Increase max_tokens
curl ... -d '{"max_tokens": 2000}'

# Check stop tokens
# Remove or adjust stop sequences

# Check context length
# Reduce prompt size or increase context

Docker Issues¶

Container Won't Start¶

# Check logs
docker logs ollama

# Check image
docker images | grep ollama

# Pull fresh image
docker pull ollama/ollama:latest

GPU Not Available in Container¶

# NVIDIA: Check runtime
docker info | grep -i runtime

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

# AMD: Check devices
ls -la /dev/kfd /dev/dri

# Verify permissions
groups  # Should include video, render

Volume Mount Issues¶

# Check mount
docker exec ollama ls -la /root/.ollama

# Fix permissions
sudo chown -R 1000:1000 /tank/ai/models/ollama

# SELinux (if applicable)
chcon -Rt svirt_sandbox_file_t /tank/ai/models/ollama

Performance Issues¶

Slow Token Generation¶

Causes: - CPU fallback - Memory thrashing - Suboptimal quantization

Solutions:

# Verify GPU usage
nvidia-smi -l 1

# Check for swap usage
free -h
swapon --show  # Should be minimal

# Use appropriate quantization
# Q4_K_M is good balance, not Q2_K

High Latency (TTFT)¶

Causes: - Large context - Model loading - Cold start

Solutions:

# Keep model loaded
OLLAMA_KEEP_ALIVE=1h

# Pre-load on start
curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "llama3.3:70b", "keep_alive": "24h"}'

# Reduce context
-c 4096  # Instead of larger

Memory Pressure¶

Solutions:

# Monitor memory
watch -n 1 free -h

# Reduce simultaneous models
OLLAMA_MAX_LOADED_MODELS=1

# Use smaller quantization

Coding Tool Issues¶

Tool Can't Connect¶

Error: Cannot connect to API

Solutions:

# Set environment
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed

# Verify API responds
curl $OPENAI_API_BASE/models

# Check tool-specific config

Wrong Model Used¶

Solutions:

# Specify model explicitly
aider --model ollama/deepseek-coder-v2:16b

# Check available models
ollama list

# In config files, use exact model name

Slow Code Completion¶

Solutions: - Use faster model for completion (7-8B) - Use larger model for complex tasks - Reduce context sent with requests

Log Analysis¶

Ollama Logs¶

# View logs
journalctl -u ollama -f

# Docker logs
docker logs -f ollama

# Debug mode
OLLAMA_DEBUG=1 ollama serve

llama.cpp Logs¶

# Enable verbose
./llama-server -m model.gguf --verbose

# Log to file
./llama-server -m model.gguf --log-file server.log

System Logs¶

# Check for OOM
dmesg | grep -i "killed process"

# GPU errors
dmesg | grep -i "gpu\|nvidia\|amd"

Recovery Steps¶

Full Reset (Ollama)¶

# Stop service
sudo systemctl stop ollama

# Clear models (optional)
rm -rf ~/.ollama/models

# Restart
sudo systemctl start ollama

# Re-pull models
ollama pull llama3.3:70b

Full Reset (Docker)¶

# Stop and remove
docker stop ollama
docker rm ollama

# Optional: clear data
sudo rm -rf /tank/ai/models/ollama/*

# Start fresh
docker run -d --gpus all \
  -v /tank/ai/models/ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

GPU Driver Reset¶

# NVIDIA
sudo nvidia-smi --gpu-reset

# If that fails
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

# Reboot if necessary
sudo reboot

Getting Help¶

Collect Diagnostic Info¶

# System info
uname -a
cat /etc/os-release

# GPU info
nvidia-smi  # or rocm-smi

# Memory
free -h

# Docker
docker version
docker info | grep -i nvidia

# Ollama
ollama --version
ollama list

Troubleshooting¶

Model Loading Issues¶

Model File Not Found¶

Invalid Model Format¶

Model Too Large¶

GPU Issues¶

GPU Not Detected¶

GPU Memory Exhausted¶

Slow GPU Performance¶

API Issues¶

Connection Refused¶

Timeout on Requests¶

Empty or Truncated Response¶

Docker Issues¶

Container Won't Start¶

GPU Not Available in Container¶

Volume Mount Issues¶

Performance Issues¶

Slow Token Generation¶

High Latency (TTFT)¶

Memory Pressure¶

Coding Tool Issues¶

Tool Can't Connect¶

Wrong Model Used¶

Slow Code Completion¶

Log Analysis¶

Ollama Logs¶

llama.cpp Logs¶

System Logs¶

Recovery Steps¶

Full Reset (Ollama)¶

Full Reset (Docker)¶

GPU Driver Reset¶

Getting Help¶

Collect Diagnostic Info¶

Where to Ask¶

See Also¶