Troubleshooting¶
Common problems and solutions for local LLM deployment on the MS-S1 MAX (AMD ROCm) and Apple Silicon laptops (Metal). NVIDIA / CUDA is not used in this build; if you see a CUDA-specific error here, treat it as generic "GPU memory" guidance.
Model loading issues¶
Model file not found¶
Causes: - Incorrect path - Model not downloaded - Wrong mount in Docker
Solutions:
# Verify file exists
ls -la /path/to/model.gguf
# For Docker, check mount
docker exec ollama ls -la /root/.ollama/models
# Ollama: re-pull model
ollama pull llama3.3:70b
Invalid model format¶
Causes: - Corrupted download - Wrong format for engine - Incompatible quantization
Solutions:
# Re-download
ollama rm llama3.3:70b
ollama pull llama3.3:70b
# For GGUF, verify file
file model.gguf
# Should show: GGUF model data
Model too large¶
Causes: - Insufficient GPU memory (UMA frame buffer too small) - Model exceeds available RAM - BIOS UMA setting needs tuning (see Memory Configuration)
Solutions:
# Use smaller quantization
ollama pull llama3.3:70b-instruct-q4_K_S # Smaller than Q4_K_M
# Reduce GPU layers
./llama-server -m model.gguf -ngl 30 # Partial offload
# Reduce context length
./llama-server -m model.gguf -c 4096
GPU issues¶
GPU not detected (AMD ROCm — MS-S1 MAX)¶
# Host: confirm ROCm sees the iGPU
rocminfo | head
rocm-smi
ls -l /dev/kfd /dev/dri
# Docker: confirm device passthrough
docker run --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
rocm/rocm-terminal rocminfo | head
# If the container can't see the iGPU, the Compose file is almost
# certainly missing `devices:` or `group_add:`.
If rocminfo returns "No agents found", ROCm itself isn't installed correctly — see ROCm Installation.
GPU not detected (Apple Silicon, laptop)¶
- Make sure you're not running inside Docker Desktop: it cannot expose Metal to containers, so MLX / Metal-backed builds must run natively.
- For PyTorch, verify with
torch.backends.mps.is_available(). - For llama.cpp, build with
LLAMA_METAL=1.
GPU memory exhausted¶
Solutions:
# AMD: check current usage
rocm-smi --showmeminfo vram
# Unload unused models
ollama stop other-model
# Use a smaller model/quantization
ollama run llama3.3:70b-instruct-q3_K_M # Instead of Q4_K_M
# Reduce context
--num-ctx 4096
On the MS-S1 MAX, "GPU memory" is a slice of the unified-memory pool configured via the BIOS UMA frame buffer. If you keep running out, you may need to raise that allocation — see Memory Configuration.
Slow GPU performance¶
Causes: - Not all layers on GPU - Thermal throttling - Power management - HSA_OVERRIDE_GFX_VERSION set incorrectly (or unset on older ROCm)
Solutions:
# Verify GPU layers
# Look for "GPU layers: 99" / "offloaded N layers to GPU" in startup log
# Check temperature and clocks (AMD)
rocm-smi --showtemp --showclocks
# Confirm ROCm backend is actually active in logs
journalctl -u ollama -f | grep -iE 'rocm|hip|gpu'
# Force full GPU offload
-ngl 99
API issues¶
Connection refused¶
Solutions:
# Check service is running
systemctl status ollama
docker ps | grep ollama
# Check binding
ss -tuln | grep 11434
# Check firewall
sudo ufw status
# Bind to all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve
Timeout on requests¶
Causes: - Model loading slowly - Large prompt - Slow hardware
Solutions:
# Increase timeout in client
curl --max-time 300 http://localhost:11434/api/generate
# Pre-load model
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.3:70b", "keep_alive": "1h"}'
# Check model is loaded
ollama ps
Empty or truncated response¶
Causes: - max_tokens too low - Stop token triggered - Context overflow
Solutions:
# Increase max_tokens
curl ... -d '{"max_tokens": 2000}'
# Check stop tokens
# Remove or adjust stop sequences
# Check context length
# Reduce prompt size or increase context
Docker issues¶
Container won't start¶
# Check logs
docker logs ollama
# Check image
docker images | grep ollama
# Pull fresh image (use :rocm on the MS-S1 MAX)
docker pull ollama/ollama:rocm
GPU not available in container¶
# AMD: Check devices
ls -la /dev/kfd /dev/dri
# Verify permissions
groups # Should include video, render
# Verify Compose file passes them through
docker compose config | grep -A4 -E 'devices|group_add'
Volume mount issues¶
# Check mount
docker exec ollama ls -la /root/.ollama
# Fix permissions
sudo chown -R 1000:1000 /mnt/tank/ai/models/ollama
# SELinux (if applicable)
chcon -Rt svirt_sandbox_file_t /mnt/tank/ai/models/ollama
Performance issues¶
Slow token generation¶
Causes: - CPU fallback (ROCm not actually engaged) - Memory thrashing - Suboptimal quantization
Solutions:
# Verify GPU usage
watch -n 1 rocm-smi
# Check for swap usage
free -h
swapon --show # Should be minimal
# Use appropriate quantization
# Q4_K_M is a good balance, not Q2_K
High latency (TTFT)¶
Causes: - Large context - Model loading - Cold start
Solutions:
# Keep model loaded
OLLAMA_KEEP_ALIVE=1h
# Pre-load on start
curl -X POST http://localhost:11434/api/generate \
-d '{"model": "llama3.3:70b", "keep_alive": "24h"}'
# Reduce context
-c 4096
Memory pressure¶
# Monitor memory
watch -n 1 free -h
# Reduce simultaneous models
OLLAMA_MAX_LOADED_MODELS=1
# Use smaller quantization
Coding tool issues¶
Tool can't connect¶
Solutions:
# Set environment
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed
# Verify API responds
curl $OPENAI_API_BASE/models
# Check tool-specific config
Wrong model used¶
# Specify model explicitly
aider --model ollama/deepseek-coder-v2:16b
# Check available models
ollama list
# In config files, use exact model name
Slow code completion¶
- Use a faster model for completion (7-8B)
- Use a larger model for complex tasks
- Reduce context sent with requests
Log analysis¶
Ollama logs¶
# View logs
journalctl -u ollama -f
# Docker logs
docker logs -f ollama
# Debug mode
OLLAMA_DEBUG=1 ollama serve
llama.cpp logs¶
# Enable verbose
./llama-server -m model.gguf --verbose
# Log to file
./llama-server -m model.gguf --log-file server.log
System logs¶
# Check for OOM
dmesg | grep -i "killed process"
# GPU errors (AMD)
dmesg | grep -iE "amdgpu|kfd|hip"
Recovery steps¶
Full reset (Ollama, native)¶
sudo systemctl stop ollama
# Clear models (optional, large!)
rm -rf ~/.ollama/models
sudo systemctl start ollama
ollama pull llama3.3:70b
Full reset (Docker, ROCm)¶
docker stop ollama
docker rm ollama
# Optional: clear data
sudo rm -rf /mnt/tank/ai/models/ollama/*
# Start fresh
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
-e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
-v /mnt/tank/ai/models/ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:rocm
GPU driver reset (AMD)¶
# Reload the amdgpu kernel module
sudo rmmod amdgpu
sudo modprobe amdgpu
# If that doesn't unstick the GPU, reboot
sudo reboot
Getting help¶
Collect diagnostic info¶
# System info
uname -a
cat /etc/os-release
# GPU info (AMD)
rocminfo | head
rocm-smi
# Memory
free -h
# Docker
docker version
docker info
# Ollama
ollama --version
ollama list
Where to ask¶
See also¶
- Performance Index - Optimization
- Memory Management - Memory issues
- GPU Containers - Container GPU setup