Architecture Decisions¶

Choose the right deployment approach for your local LLM infrastructure.

Decision Tree¶

┌─────────────────────────────────────────────────────────────┐
│                 Where to run LLM inference?                  │
└─────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
        ┌─────────┐     ┌─────────┐     ┌─────────┐
        │  Native │     │Container│     │   VM    │
        └────┬────┘     └────┬────┘     └────┬────┘
             │               │               │
             ▼               ▼               ▼
    ┌────────────────┐ ┌────────────┐ ┌────────────────┐
    │ Best for:      │ │ Best for:  │ │ Best for:      │
    │ - macOS/MLX    │ │ - Linux    │ │ - Windows apps │
    │ - Max perf     │ │ - Services │ │ - GPU passthru │
    │ - GUI tools    │ │ - Multi-   │ │ - Isolation    │
    │                │ │   instance │ │                │
    └────────────────┘ └────────────┘ └────────────────┘

Comparison Matrix¶

Factor	Native	Container	VM
Performance	Best	Good (-5-10%)	Good (with passthrough)
Isolation	None	Process-level	Full
GPU Access	Direct	Varies by platform	Passthrough required
Setup complexity	Low	Medium	High
Portability	Low	High	Medium
macOS support	Excellent	Limited GPU	No GPU passthrough
Linux support	Good	Excellent	Good

Native Installation¶

Run inference engines directly on the host OS.

When to Choose Native¶

macOS with Apple Silicon - MLX and Metal require native access
Maximum performance - No virtualization overhead
GUI applications - LM Studio, Jan.ai
Development/testing - Quick iteration

Engines for Native¶

Engine	macOS	Linux	Notes
MLX	Excellent	N/A	Apple Silicon only
llama.cpp	Good (Metal)	Good (CUDA)	Cross-platform
Ollama	Good	Good	Docker-like UX
LM Studio	Excellent	Good	GUI
Jan.ai	Good	Good	GUI, offline-first

Native Example (macOS)¶

# Install MLX
pip install mlx-lm

# Or install Ollama natively
brew install ollama
ollama serve
ollama pull llama3.3:70b-instruct-q4_K_M

Container Deployment¶

Run inference engines in Docker/Podman containers.

When to Choose Containers¶

Linux servers - NVIDIA Container Toolkit for GPU
Service isolation - Separate models/configs
Reproducibility - Consistent deployments
Multi-tenant - Different users/applications
Orchestration - Compose, Kubernetes

Container GPU Access¶

Platform	GPU Access	Setup
Linux + NVIDIA	Excellent	nvidia-container-toolkit
Linux + AMD	Good	ROCm containers
macOS	Limited	No Metal passthrough
Windows + WSL2	Good	CUDA support in WSL

Container Example (Linux)¶

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - /tank/ai/models/ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

macOS Container Limitations¶

Containers on macOS cannot access Metal GPU:

┌──────────────────────────────────────────────┐
│                macOS Host                     │
│  ┌────────────────┐  ┌────────────────────┐  │
│  │  Native Apps   │  │  Docker Desktop    │  │
│  │  (Metal GPU)   │  │  (CPU only)        │  │
│  │  ┌──────────┐  │  │  ┌──────────────┐  │  │
│  │  │ LM Studio│  │  │  │   Ollama     │  │  │
│  │  │ MLX      │  │  │  │ (no Metal)   │  │  │
│  │  └──────────┘  │  │  └──────────────┘  │  │
│  └────────────────┘  └────────────────────┘  │
└──────────────────────────────────────────────┘

For macOS: Use native Ollama or LM Studio, not containerized versions.

Virtual Machine¶

Run LLMs inside a full virtual machine.

When to Choose VMs¶

Windows-only tools - LM Studio has good Windows support
GPU passthrough - Dedicate GPU to VM
Strong isolation - Security boundaries
Testing different OSes - Linux distros, Windows

VM GPU Passthrough¶

See GPU Passthrough for detailed setup.

┌───────────────────────────────────────────────────────┐
│                     Host OS (Linux)                    │
│  ┌─────────────────────────────────────────────────┐  │
│  │                QEMU/KVM                          │  │
│  │  ┌───────────────────────────────────────────┐  │  │
│  │  │           Windows 11 VM                    │  │  │
│  │  │  ┌─────────────────────────────────────┐  │  │  │
│  │  │  │       LM Studio + GPU               │  │  │  │
│  │  │  │    (OpenAI-compatible API)          │  │  │  │
│  │  │  └─────────────────────────────────────┘  │  │  │
│  │  └───────────────────────────────────────────┘  │  │
│  └─────────────────────────────────────────────────┘  │
│                          │                             │
│                   API accessible                       │
│              (host, containers, network)               │
└───────────────────────────────────────────────────────┘

VM Example¶

# Expose LM Studio API from Windows VM
# In VM: LM Studio → Local Server → Start
# API available at http://vm-ip:1234/v1/

# From host or container:
curl http://192.168.122.10:1234/v1/models

Hybrid Approaches¶

Combine approaches for flexibility:

Development Setup¶

Native (daily use):
├── LM Studio (GUI, model testing)
└── Ollama (CLI, API)

Container (services):
├── Open WebUI (web interface)
└── LocalAI (API gateway)

Production Setup¶

Container (primary):
├── Ollama (main inference)
├── llama.cpp (specific models)
└── Traefik (load balancing)

Native (fallback):
└── MLX (macOS-specific workloads)

Recommendations by Use Case¶

AI-Assisted Coding¶

Scenario	Recommendation
macOS daily driver	Native Ollama + LM Studio
Linux server	Containerized Ollama
Mixed fleet	Container API + native clients

Multi-User Service¶

Requirement	Solution
Web interface	Open WebUI container
API access	Ollama/llama.cpp container
Authentication	Open WebUI or reverse proxy

Maximum Performance¶

Platform	Solution
Apple Silicon	Native MLX
NVIDIA GPU	Native or container llama.cpp
Multi-GPU	vLLM container

Migration Paths¶

Native to Container¶

# Export Ollama models
ollama list  # Note model names

# In container
docker exec ollama ollama pull <model>

Container to Container¶

# Models stored on ZFS volume are portable
# Just mount the same volume in new container
volumes:
  - /tank/ai/models/ollama:/root/.ollama