Skip to content

Container Deployment

Deploy LLM inference engines in Docker containers with persistent model storage.

Overview

Containerized LLM deployment provides:

  • Isolation - Separate engines, models, and configurations
  • Reproducibility - Consistent environments across systems
  • Easy updates - Pull new images without affecting data
  • ZFS integration - Models stored on persistent datasets

Container Strategy

┌─────────────────────────────────────────────────────────────────┐
│                    Docker/Podman Host                           │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Ollama    │  │ llama.cpp   │  │      Open WebUI         │  │
│  │  Container  │  │  Container  │  │       Container         │  │
│  └──────┬──────┘  └──────┬──────┘  └────────────┬────────────┘  │
│         │                │                      │               │
│         └────────────────┴──────────────────────┘               │
│                          │                                      │
│              OpenAI-Compatible APIs                             │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│                    Volume Mounts                                │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                 /tank/ai/models                            │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐    │  │
│  │  │   ollama/   │  │    gguf/    │  │  huggingface/   │    │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────┘    │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

GPU Access in Containers

NVIDIA GPUs

Excellent container support via nvidia-container-toolkit:

services:
  ollama:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

AMD GPUs (ROCm)

Good support on Linux with ROCm:

services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render

See GPU Containers for detailed GPU setup.

Container Comparison

Feature Ollama llama.cpp LocalAI
Setup complexity Easy Medium Medium
Model management Built-in Manual Manual
GPU support NVIDIA, AMD NVIDIA, AMD, Vulkan NVIDIA, AMD
OpenAI API Yes Yes Yes
Multi-model Yes Per-instance Yes
Official image Yes Yes Yes

Quick Start

# Start Ollama container
docker run -d \
  --gpus all \
  -v /tank/ai/models/ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec ollama ollama run llama3.3

llama.cpp

# Start llama.cpp server
docker run -d \
  --gpus all \
  -v /tank/ai/models/gguf:/models \
  -p 8080:8080 \
  --name llama-server \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/llama-3.3-70b-q4_k_m.gguf \
  -c 8192 -ngl 99

Storage Layout

ZFS Dataset Structure

# Create AI models dataset
zfs create tank/ai
zfs create -o recordsize=1M tank/ai/models
zfs create tank/ai/models/ollama
zfs create tank/ai/models/gguf
zfs create tank/ai/models/huggingface

Volume Mounts

volumes:
  - /tank/ai/models/ollama:/root/.ollama
  - /tank/ai/models/gguf:/models:ro
  - /tank/ai/models/huggingface:/root/.cache/huggingface:ro

Network Configuration

Internal Network

networks:
  ai-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.30.0.0/24

services:
  ollama:
    networks:
      - ai-net
  webui:
    networks:
      - ai-net
    environment:
      - OLLAMA_API_BASE_URL=http://ollama:11434

External Access

services:
  ollama:
    ports:
      - "127.0.0.1:11434:11434"  # Local only
      # Or
      - "11434:11434"  # Network access (use with firewall)

Topics

  • :material-llama: llama.cpp Docker


    Official llama.cpp container with GPU support

    Setup guide

  • Ollama Docker


    Ollama container with model persistence

    Setup guide

  • Model Volumes


    ZFS dataset configuration for model storage

    Storage setup

  • :material-gpu: GPU Containers


    GPU passthrough for NVIDIA and AMD

    GPU setup

Common Patterns

Development Stack

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    volumes:
      - /tank/ai/models/ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  webui:
    image: ghcr.io/open-webui/open-webui:main
    volumes:
      - /tank/ai/data/webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

Multi-Backend

services:
  ollama:
    # Fast model switching
    image: ollama/ollama
    # ...

  llama-server:
    # Specific model, fine-tuned config
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    # ...

  traefik:
    # Route requests to appropriate backend
    image: traefik:v3.0
    # ...

See Also