Ollama Docker¶
Deploy Ollama in Docker with persistent model storage and GPU acceleration on the MS-S1 MAX (AMD ROCm).
Official Images¶
| Image | Backend | Use case |
|---|---|---|
ollama/ollama:rocm | AMD ROCm | MS-S1 MAX (primary) |
ollama/ollama | CPU / NVIDIA default | Reference only — not used here |
ollama/ollama:latest | Same as default | Reference only |
The default
ollama/ollamaimage ships with the CUDA backend, which the MS-S1 MAX cannot use. Always pull the:rocmtag on this host.
Quick start¶
AMD GPU (ROCm) — MS-S1 MAX¶
docker run -d \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--group-add render \
-v /mnt/tank/ai/models/ollama:/root/.ollama \
-e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
-p 11434:11434 \
--name ollama \
ollama/ollama:rocm
# Pull a model
docker exec ollama ollama pull llama3.3:70b
# Run model
docker exec -it ollama ollama run llama3.3:70b
CPU only (fallback / lightweight workloads)¶
docker run -d \
-v /mnt/tank/ai/models/ollama:/root/.ollama \
-p 11434:11434 \
--name ollama-cpu \
ollama/ollama
Docker Compose¶
Basic AMD ROCm setup¶
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
volumes:
- /mnt/tank/ai/models/ollama:/root/.ollama
ports:
- "11434:11434"
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
restart: unless-stopped
Production configuration¶
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
volumes:
- /mnt/tank/ai/models/ollama:/root/.ollama
ports:
- "127.0.0.1:11434:11434" # Local only — front with a reverse proxy
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
OLLAMA_HOST: "0.0.0.0"
OLLAMA_NUM_PARALLEL: "2"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "30m"
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
deploy:
resources:
limits:
memory: 100G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
Environment variables¶
Configure Ollama behavior via environment:
| Variable | Description | Default |
|---|---|---|
OLLAMA_HOST | Listen address:port | 127.0.0.1:11434 |
OLLAMA_MODELS | Model storage path | /root/.ollama |
OLLAMA_NUM_PARALLEL | Concurrent requests | 1 |
OLLAMA_MAX_LOADED_MODELS | Models in memory | 1 |
OLLAMA_KEEP_ALIVE | Model unload timeout | 5m |
OLLAMA_DEBUG | Debug logging | false |
HSA_OVERRIDE_GFX_VERSION | ROCm GPU arch override (11.5.1 for gfx1151 on older ROCm) | unset |
environment:
OLLAMA_HOST: "0.0.0.0:11434"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "1h"
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
Model management¶
Pull models¶
# From host
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M
docker exec ollama ollama pull deepseek-coder-v2:16b
docker exec ollama ollama pull nomic-embed-text
# List models
docker exec ollama ollama list
Pre-load on startup¶
services:
ollama:
image: ollama/ollama:rocm
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
volumes:
- /mnt/tank/ai/models/ollama:/root/.ollama
- ./init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/bash", "-c"]
command:
- |
/bin/ollama serve &
sleep 5
/init-models.sh
wait
#!/bin/bash
# init-models.sh
ollama pull llama3.3:70b-instruct-q4_K_M
ollama pull deepseek-coder-v2:16b
Import GGUF models¶
# Create Modelfile
cat > /mnt/tank/ai/models/ollama/Modelfile << 'EOF'
FROM /models/gguf/custom-model.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
EOF
# Mount GGUF volume and create model
docker run --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
-v /mnt/tank/ai/models/ollama:/root/.ollama \
-v /mnt/tank/ai/models/gguf:/models/gguf:ro \
ollama/ollama:rocm create custom-model -f /root/.ollama/Modelfile
With Open WebUI¶
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
volumes:
- /mnt/tank/ai/models/ollama:/root/.ollama
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- /mnt/tank/ai/data/open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
networks:
default:
name: ai-network
API usage¶
Chat completion (OpenAI compatible)¶
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Docker?"}
]
}'
Native API¶
# Generate
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.3:70b",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.3:70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Model management API¶
# List models
curl http://localhost:11434/api/tags
# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.3:70b"}'
# Pull model
curl http://localhost:11434/api/pull -d '{"name": "llama3.3:70b"}'
# Delete model
curl http://localhost:11434/api/delete -d '{"name": "old-model"}'
Single-instance pattern (MS-S1 MAX)¶
The MS-S1 MAX has one iGPU sharing the unified-memory pool, so the "two Ollama containers, one per GPU" pattern doesn't apply. Use a single Ollama instance and configure it to keep multiple models warm:
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
OLLAMA_MAX_LOADED_MODELS: "3"
OLLAMA_KEEP_ALIVE: "1h"
OLLAMA_NUM_PARALLEL: "2"
HSA_OVERRIDE_GFX_VERSION: "11.5.1"
volumes:
- /mnt/tank/ai/models/ollama:/root/.ollama
ports:
- "11434:11434"
Ollama will swap models in/out of GPU memory as requests come in, and OLLAMA_KEEP_ALIVE prevents the most-used ones from being unloaded.
Monitoring¶
Health checks¶
Resource usage¶
Logs¶
Storage persistence¶
Volume structure¶
/mnt/tank/ai/models/ollama/
|-- models/
| |-- blobs/ # Model weights (large files)
| | `-- sha256-xxx
| `-- manifests/ # Model metadata
| `-- registry.ollama.ai/
| `-- library/
| `-- llama3.3/
`-- history # Chat history (optional)
Backup models¶
# Models are in the blobs directory
# Backup manifests to remember which models
tar czf ollama-manifests.tar.gz /mnt/tank/ai/models/ollama/models/manifests
# Full backup (large)
zfs snapshot tank/ai/models/ollama@backup
Troubleshooting¶
Container won't start¶
GPU not available¶
# Host first
rocm-smi
rocminfo | head
# Then in a container
docker run --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
rocm/rocm-terminal rocminfo | head
If rocminfo works on the host but not in the container, the most common cause is missing --device= / --group-add flags.
Model pull fails¶
df -h /mnt/tank/ai/models/ollama
docker exec ollama rm -rf /root/.ollama/models/blobs/*.partial
docker exec ollama ollama pull llama3.3:70b
Out of memory¶
# Reduce concurrent models
OLLAMA_MAX_LOADED_MODELS=1
# Use smaller quantization
ollama pull llama3.3:70b-instruct-q4_K_S
# Check current memory
docker exec ollama ollama ps
Slow response¶
# Verify GPU is being used
docker exec ollama ollama ps
# Should show "GPU" in the PROCESSOR column, not "CPU"
# Confirm the ROCm backend is active in logs
docker logs ollama 2>&1 | grep -iE 'rocm|hip|gpu'
# Increase parallel slots for concurrent requests
OLLAMA_NUM_PARALLEL=4
See also¶
- Container Deployment — container overview
- GPU Containers — GPU setup details
- Model Volumes — storage configuration
- Ollama — Ollama reference
- Open WebUI — web interface