GPU Containers¶

Configure GPU access for containerized LLM inference on the MS-S1 MAX (AMD Strix Halo, ROCm) and — for cross-reference — Apple Silicon laptops.

GPU Support Matrix¶

Platform	GPU	Container support	Toolchain
MS-S1 MAX (Linux)	AMD Strix Halo iGPU (gfx1151)	Yes — `/dev/kfd` + `/dev/dri` passthrough	ROCm 7.x
Linux	AMD discrete (RDNA 3/CDNA)	Yes	ROCm 7.x
macOS	Apple Silicon	None — Docker Desktop does not expose Metal	Run natively (MLX, llama.cpp Metal)

Not used on the MS-S1 MAX: NVIDIA / nvidia-container-toolkit / CUDA images. The Strix Halo iGPU is an AMD device; this whole site assumes a CUDA-free stack.

AMD ROCm setup (this build)¶

Native vs container

For direct inference without containers, see ROCm Installation. For an APU like the MS-S1 MAX, native installation can simplify debugging; containers buy you reproducibility and isolation.

Install ROCm on the host¶

# Ubuntu 26.04: ROCm 7.1 ships in Universe
sudo apt update
sudo apt install rocm

# Grant container processes access to the GPU
sudo usermod -aG video,render $USER

For ROCm newer than what is in the Ubuntu archive, install AMD's amdgpu-install from repo.radeon.com. See ROCm Installation for the upstream path.

Verify the host can talk to the GPU¶

rocminfo | head
rocm-smi
ls -l /dev/kfd /dev/dri

You should see gfx1151 in rocminfo output and the kfd + dri devices on disk.

Docker Compose configuration¶

services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    environment:
      HSA_OVERRIDE_GFX_VERSION: "11.5.1"  # only needed for older ROCm
    volumes:
      - /mnt/tank/ai/models/ollama:/root/.ollama

`docker run` syntax¶

docker run -d \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --group-add render \
  -e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
  -v /mnt/tank/ai/models/ollama:/root/.ollama \
  ollama/ollama:rocm

ROCm with llama.cpp¶

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    environment:
      HSA_OVERRIDE_GFX_VERSION: "11.5.1"
    command: >
      -m /models/llama-3.3-70b-q4_k_m.gguf
      --host 0.0.0.0
      -c 8192
      -ngl 99

Vulkan (fallback)¶

For GPUs where the ROCm runtime is not ready (or you want a portable build), llama.cpp also ships a Vulkan backend.

Host setup¶

sudo apt install vulkan-tools libvulkan1
vulkaninfo | head

Container configuration¶

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    devices:
      - /dev/dri
    group_add:
      - video
      - render

Vulkan is generally slower than ROCm on the Strix Halo iGPU, but is useful as a sanity check if a ROCm image misbehaves.

Memory management¶

The MS-S1 MAX has a single iGPU sharing the system memory pool, so container "GPU memory limits" don't apply the way they do on a discrete-GPU box. Instead:

Choose quantization that fits comfortably (Q4_K_M for 70B, etc.).
Use the BIOS UMA frame buffer setting to give ROCm enough headroom — see Memory Configuration.
Only run one inference engine at a time unless you have explicit reason to share.

Shared memory¶

Some workloads (vLLM, tensor-parallel runs on multi-GPU rigs) need larger shm:

services:
  llama-server:
    shm_size: '16gb'

Offloading strategies¶

If a model is too big even at Q4:

# Partial GPU offload — keep some layers on CPU
llama-server -m model.gguf -ngl 30

# Ollama adjusts automatically based on available memory

Monitoring GPU usage¶

From the host¶

# Real-time monitoring (AMD ROCm)
watch -n 1 rocm-smi

# GPU utilization
rocm-smi --showuse

# VRAM (unified memory carved out for the GPU)
rocm-smi --showmeminfo vram

From inside a container¶

docker exec ollama rocm-smi
docker stats ollama
docker logs ollama 2>&1 | grep -iE 'rocm|hip|gpu'

Troubleshooting¶

GPU not detected in the container¶

# Host first: do you see the GPU at all?
rocminfo | head
ls -l /dev/kfd /dev/dri

# Then in a clean container
docker run --rm \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  rocm/rocm-terminal:latest rocminfo | head

If the host sees the GPU but the container doesn't, the most common cause is missing --device= / --group-add flags.

Permission denied (`/dev/kfd` or `/dev/dri/renderD*`)¶

# Add user to groups (one-time)
sudo usermod -aG video,render $USER

# New shell to pick up the groups
newgrp video
newgrp render

# Verify device permissions
ls -la /dev/kfd /dev/dri/*

Out of GPU memory¶

# Check what's currently using the GPU
rocm-smi

# Solutions:
# 1. Use higher quantization (Q4 instead of Q8)
# 2. Reduce context length
# 3. Reduce GPU layers (-ngl)
# 4. Unload unused models
docker exec ollama ollama stop model-name

`HSA_OVERRIDE_GFX_VERSION` confusion¶

On older ROCm (6.x) the Strix Halo iGPU required HSA_OVERRIDE_GFX_VERSION=11.5.1 because the runtime didn't recognise gfx1151 by default. ROCm 7.x supports gfx1151 natively, so the override is no longer required — but setting it does no harm and lets the same Compose file work on both ROCm 6 and 7.

Environment variables reference¶

AMD / ROCm¶

Variable	Description
`HIP_VISIBLE_DEVICES`	Limit which GPUs HIP sees
`ROCR_VISIBLE_DEVICES`	Alternative device selection (HSA runtime)
`HSA_OVERRIDE_GFX_VERSION`	Override GPU architecture (e.g. `11.5.1` for gfx1151 on older ROCm)
`GPU_MAX_HW_QUEUES`	Bound on hardware queue count, useful for tuning