Skip to content

GPU Containers

Configure GPU access for containerized LLM inference.

GPU Support Matrix

Platform GPU Container Support Framework
Linux NVIDIA Excellent nvidia-container-toolkit
Linux AMD Good ROCm
Linux Intel Experimental OneAPI
macOS Apple Silicon None Use native
Windows WSL2 NVIDIA Good nvidia-container-toolkit

NVIDIA Setup

Install Container Toolkit

# Add repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Installation

# Test GPU access in container
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Docker Compose Configuration

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # or specific number
              capabilities: [gpu]

Specify GPU Devices

# All GPUs
devices:
  - driver: nvidia
    count: all
    capabilities: [gpu]

# Specific number
devices:
  - driver: nvidia
    count: 2
    capabilities: [gpu]

# Specific GPU IDs
devices:
  - driver: nvidia
    device_ids: ['0', '1']
    capabilities: [gpu]

docker run Syntax

# All GPUs
docker run --gpus all ...

# Specific count
docker run --gpus 2 ...

# Specific device
docker run --gpus '"device=0,1"' ...

AMD ROCm Setup

Native Installation

For direct inference without containers, see ROCm Installation. Native installation may provide better performance and simpler debugging for APU configurations.

Install ROCm

# Add repository (Ubuntu 24.04)
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/noble/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm

# Add user to groups
sudo usermod -aG video,render $USER

Verify Installation

# Check ROCm
rocminfo

# Check GPU
rocm-smi

Container Configuration

version: '3.8'

services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    volumes:
      - /tank/ai/models/ollama:/root/.ollama

docker run Syntax

docker run -d \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --group-add render \
  -v /tank/ai/models/ollama:/root/.ollama \
  ollama/ollama:rocm

ROCm with llama.cpp

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    command: >
      -m /models/llama-3.3-70b-q4_k_m.gguf
      --host 0.0.0.0
      -c 8192
      -ngl 99

Vulkan (Cross-Platform)

For GPUs not well-supported by CUDA or ROCm:

Host Setup

# Install Vulkan
sudo apt install vulkan-tools libvulkan1

# Verify
vulkaninfo

Container Configuration

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    devices:
      - /dev/dri
    group_add:
      - video
      - render

Multi-GPU Configurations

Split Workloads

Assign different models to different GPUs:

version: '3.8'

services:
  chat-model:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

  code-model:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0  # Container sees it as GPU 0

Tensor Parallelism (vLLM)

For models too large for one GPU:

services:
  vllm:
    image: vllm/vllm-openai
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    command: >
      --model meta-llama/Llama-3.1-405B-Instruct
      --tensor-parallel-size 2

Monitoring GPU Usage

NVIDIA

# Real-time monitoring
nvidia-smi -l 1

# Watch specific metrics
watch -n 1 nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv

# From inside container
docker exec ollama nvidia-smi

AMD

# Real-time monitoring
watch -n 1 rocm-smi

# GPU usage
rocm-smi --showuse

# Memory usage
rocm-smi --showmeminfo vram

Container Stats

# Docker stats with GPU
docker stats ollama

# GPU utilization in container logs
docker logs ollama 2>&1 | grep -i gpu

Memory Management

GPU Memory Limits

NVIDIA containers can limit GPU memory:

environment:
  - CUDA_VISIBLE_DEVICES=0
  # Ollama doesn't support direct memory limits
  # Use model quantization to control memory

Shared Memory

Some workloads need increased shared memory:

services:
  llama-server:
    shm_size: '16gb'

Offloading Strategies

When GPU memory is limited:

# Partial GPU offload
llama-server -m model.gguf -ngl 30  # Only 30 layers on GPU

# Ollama adjusts automatically based on available memory

Troubleshooting

GPU Not Detected

# NVIDIA: Check driver
nvidia-smi

# Check container toolkit
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

# If fails, reconfigure
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Permission Denied (AMD)

# Add user to groups
sudo usermod -aG video,render $USER

# Log out and back in, or:
newgrp video
newgrp render

# Verify device permissions
ls -la /dev/kfd /dev/dri/*

CUDA Version Mismatch

# Check host driver version
nvidia-smi

# Use matching container image
# Driver 535+ → CUDA 12.x images
# Driver 525+ → CUDA 11.8 images
docker run --gpus all nvidia/cuda:12.1-base nvidia-smi

Out of GPU Memory

# Check current usage
nvidia-smi  # or rocm-smi

# Solutions:
# 1. Use higher quantization (Q4 instead of Q8)
# 2. Reduce context length
# 3. Reduce GPU layers (-ngl)
# 4. Unload unused models
docker exec ollama ollama stop model-name

Container Can't Access GPU

# Verify Docker runtime
docker info | grep -i runtime

# Should show nvidia runtime available
# If not, reinstall nvidia-container-toolkit

# Check GPU passthrough in compose
docker compose config | grep -A5 devices

Environment Variables Reference

NVIDIA

Variable Description
CUDA_VISIBLE_DEVICES Limit visible GPUs
NVIDIA_VISIBLE_DEVICES Same as above (Docker)
NVIDIA_DRIVER_CAPABILITIES Required capabilities

AMD/ROCm

Variable Description
HIP_VISIBLE_DEVICES Limit visible GPUs
ROCR_VISIBLE_DEVICES Alternative device selection
HSA_OVERRIDE_GFX_VERSION Override GPU architecture

See Also