Architecture Decisions¶

Choose the right deployment approach for your local LLM infrastructure.

Decision Tree¶

┌─────────────────────────────────────────────────────────────┐
│                 Where to run LLM inference?                  │
└─────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              v               v               v
        ┌─────────┐     ┌─────────┐     ┌─────────┐
        │  Native │     │Container│     │   VM    │
        └────┬────┘     └────┬────┘     └────┬────┘
             │               │               │
             v               v               v
    ┌────────────────┐ ┌────────────┐ ┌────────────────┐
    │ Best for:      │ │ Best for:  │ │ Best for:      │
    │ - macOS/MLX    │ │ - Linux    │ │ - Windows apps │
    │ - Max perf     │ │ - Services │ │ - GPU passthru │
    │ - GUI tools    │ │ - Multi-   │ │ - Isolation    │
    │                │ │   instance │ │                │
    └────────────────┘ └────────────┘ └────────────────┘

Comparison Matrix¶

Factor	Native	Container	VM
Performance	Best	Good (-5-10%)	Good (with passthrough)
Isolation	None	Process-level	Full
GPU Access	Direct	Varies by platform	Passthrough required
Setup complexity	Low	Medium	High
Portability	Low	High	Medium
macOS support	Excellent	Limited GPU	No GPU passthrough
Linux support	Good	Excellent	Good

Native Installation¶

Run inference engines directly on the host OS.

When to Choose Native¶

macOS with Apple Silicon - MLX and Metal require native access
Maximum performance - No virtualization overhead
GUI applications - LM Studio, Jan.ai
Development/testing - Quick iteration

Engines for Native¶

Engine	macOS	Linux	Notes
MLX	Excellent	N/A	Apple Silicon only
llama.cpp	Good (Metal)	Good (ROCm/HIP on gfx1151)	Cross-platform
Ollama	Good	Good (ROCm build)	Docker-like UX
LM Studio	Excellent	Good	GUI
Jan.ai	Good	Good	GUI, offline-first

Native Example (macOS)¶

# Install MLX
pip install mlx-lm

# Or install Ollama natively
brew install ollama
ollama serve
ollama pull llama3.3:70b-instruct-q4_K_M

Container Deployment¶

Run inference engines in Docker/Podman containers.

When to Choose Containers¶

Linux servers - ROCm device passthrough for GPU (the MS-S1 MAX path)
Service isolation - Separate models/configs
Reproducibility - Consistent deployments
Multi-tenant - Different users/applications
Orchestration - Compose, Kubernetes

Container GPU Access¶

Platform	GPU Access	Setup
Linux + AMD (MS-S1 MAX)	Good	`/dev/kfd` + `/dev/dri` passthrough, `video`/`render` groups
macOS	Limited	No Metal passthrough — run engines natively
Linux + NVIDIA	n/a here	Reference only — not used on this build

Container Example (Linux + AMD ROCm, MS-S1 MAX)¶

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:rocm
    volumes:
      - /mnt/tank/ai/models/ollama:/root/.ollama
    ports:
      - "11434:11434"
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.5.1  # gfx1151 (Strix Halo)

macOS Container Limitations¶

Containers on macOS cannot access Metal GPU:

┌──────────────────────────────────────────────┐
│                macOS Host                     │
│  ┌────────────────┐  ┌────────────────────┐  │
│  │  Native Apps   │  │  Docker Desktop    │  │
│  │  (Metal GPU)   │  │  (CPU only)        │  │
│  │  ┌──────────┐  │  │  ┌──────────────┐  │  │
│  │  │ LM Studio│  │  │  │   Ollama     │  │  │
│  │  │ MLX      │  │  │  │ (no Metal)   │  │  │
│  │  └──────────┘  │  │  └──────────────┘  │  │
│  └────────────────┘  └────────────────────┘  │
└──────────────────────────────────────────────┘

For macOS: Use native Ollama or LM Studio, not containerized versions.

Virtual Machine¶

Run LLMs inside a full virtual machine.

When to Choose VMs¶

Windows-only tools - LM Studio has good Windows support
Strong isolation - Security boundaries
Testing different OSes - Linux distros, Windows

GPU passthrough on the MS-S1 MAX

The Strix Halo iGPU is shared between the host and the iGPU display path; full PCIe passthrough is not the recommended deployment. Run inference engines in containers on the host with /dev/kfd + /dev/dri passthrough instead.

VM GPU Passthrough¶

See GPU Passthrough for detailed setup.

┌───────────────────────────────────────────────────────┐
│                     Host OS (Linux)                    │
│  ┌─────────────────────────────────────────────────┐  │
│  │                QEMU/KVM                          │  │
│  │  ┌───────────────────────────────────────────┐  │  │
│  │  │           Windows 11 VM                    │  │  │
│  │  │  ┌─────────────────────────────────────┐  │  │  │
│  │  │  │       LM Studio + GPU               │  │  │  │
│  │  │  │    (OpenAI-compatible API)          │  │  │  │
│  │  │  └─────────────────────────────────────┘  │  │  │
│  │  └───────────────────────────────────────────┘  │  │
│  └─────────────────────────────────────────────────┘  │
│                          │                             │
│                   API accessible                       │
│              (host, containers, network)               │
└───────────────────────────────────────────────────────┘

VM Example¶

# Expose LM Studio API from Windows VM
# In VM: LM Studio -> Local Server -> Start
# API available at http://vm-ip:1234/v1/

# From host or container:
curl http://192.168.122.10:1234/v1/models

Hybrid Approaches¶

Combine approaches for flexibility:

Development Setup¶

Native (daily use):
├── LM Studio (GUI, model testing)
└── Ollama (CLI, API)

Container (services):
├── Open WebUI (web interface)
└── LocalAI (API gateway)

Production Setup¶

Container (primary):
├── Ollama (main inference)
├── llama.cpp (specific models)
└── Traefik (load balancing)

Native (fallback):
└── MLX (macOS-specific workloads)

Recommendations by Use Case¶

AI-Assisted Coding¶

Scenario	Recommendation
macOS daily driver	Native Ollama + LM Studio
Linux server	Containerized Ollama
Mixed fleet	Container API + native clients

Multi-User Service¶

Requirement	Solution
Web interface	Open WebUI container
API access	Ollama/llama.cpp container
Authentication	Open WebUI or reverse proxy

Maximum Performance¶

Platform	Solution
Apple Silicon (laptops)	Native MLX
AMD Strix Halo (MS-S1 MAX)	Container llama.cpp built with HIP for `gfx1151`
Multi-GPU datacenter (not this build)	vLLM, reference only

Migration Paths¶

Native to Container¶

# Export Ollama models
ollama list  # Note model names

# In container
docker exec ollama ollama pull <model>

Container to Container¶

# Models stored on ZFS volume are portable
# Just mount the same volume in new container
volumes:
  - /mnt/tank/ai/models/ollama:/root/.ollama