AI & Local LLMs¶

Run large language models locally on a 128GB unified memory system for privacy, cost savings, and low latency.

Why Local LLMs?¶

Benefit	Description
Privacy	Data never leaves your machine
Cost	No API fees after hardware investment
Latency	Sub-100ms first token for local inference
Offline	Works without internet connection
Control	Choose models, tune parameters, no rate limits

Architecture Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                        Client Layer                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ Claude Code │  │   Aider     │  │  Cline / Continue.dev   │  │
│  └──────┬──────┘  └──────┬──────┘  └────────────┬────────────┘  │
│         │                │                      │                │
│         └────────────────┴──────────────────────┘                │
│                          │                                       │
│                   OpenAI-Compatible API                          │
│                          │                                       │
├──────────────────────────┼───────────────────────────────────────┤
│                    Inference Layer                               │
│  +-------------+  +-------------+  +---------------------+      │
│  |   Ollama    |  | llama-server|  |   Native build       |     │
│  | (ROCm)      |  | (ROCm/HIP)  |  | (cmake -DGGML_HIP=ON)|     │
│  +------+------+  +------+------+  +----------+-----------+     │
│         |                |                    |                  │
├─────────┴────────────────┴────────────────────┴──────────────────┤
│                     Storage Layer                                │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │            tank/ai/models (ZFS Dataset)                    │  │
│  │  recordsize=1M │ compression=off (models already compressed)│  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Engine choices for Strix Halo / gfx1151

On this hardware the practical engines are llama.cpp (HIP build) and Ollama (which uses llama.cpp under the hood). MLX is Apple-Silicon-only, and vLLM does not currently ship a working build for gfx1151 — keep both off the recommended path until that changes. See Inference Engines for details.

Quick Start Paths¶

Path 1: Web UI in front of Ollama (easiest)¶

Install ROCm and verify gfx1151 is detected.
Install Ollama natively (it auto-detects ROCm).
Deploy Open WebUI as a Docker container; point it at the host Ollama.
Pull a 70B Q4 model and chat from any browser.

Path 2: Container stack (compose-managed)¶

Create the ZFS dataset for model files.
Deploy the Ollama container with /dev/kfd + /dev/dri device mounts.
Expose Ollama's OpenAI-compatible endpoint on the LAN (or only via Tailscale).

Path 3: Maximum performance (native HIP build)¶

Install ROCm 7.x.
Build llama.cpp with HIP for gfx1151.
Run llama-server directly under systemd; tune --parallel, -ngl 99, KV-cache quantization.

What You Can Run on 128GB¶

Assuming GTT is sized to ~108 GB (see Memory Configuration), with ~20 GB reserved for the OS / ARC / VMs:

Model Size	Quantization	Memory	Example Models
7-8B	Q8_0	~10 GB	Llama 3.x 8B, Mistral 7B, Qwen3 8B
32-34B	Q4_K_M	~20 GB	Qwen3 32B, DeepSeek Coder, Codestral
70B	Q4_K_M	~40 GB	Llama 3.3 70B, Qwen3 72B
70B	Q6_K	~55 GB	Higher-quality 70B with long context
70B	Q8_0	~75 GB	Near-fp16 quality 70B
120-200B MoE	Q4/Q8	60-110 GB	DeepSeek-V3-class, GPT-OSS-120B-MXFP4
405B	IQ2/IQ1	~100 GB	Llama 3.1 405B, very tight context

Real-world tok/s on this box (ROCm/HIP, single-stream):

Model	Tokens/sec (gen)
8B Q4	~50-70
32B Q4	~15-20
70B Q4	~6-9
70B Q6	~4-6
405B IQ2	~1-2

Model recommendations rot fast — verify on ollama.com or huggingface.co/models on the day you set this up.

Section Overview¶

Fundamentals

Why local LLMs, unified memory advantages, architecture decisions

Learn basics
Inference Engines

llama.cpp (HIP) and Ollama on Strix Halo; MLX/vLLM noted but not for this box

Compare engines
GUI Tools

LM Studio, Jan.ai, Open WebUI - visual interfaces

GUI options
Container Deployment

Docker setups for Ollama and llama.cpp with ZFS storage

Containers
API Serving

OpenAI-compatible endpoints, LocalAI, load balancing

API setup
VM Integration

Call the host's Ollama API from VMs (the host owns the iGPU)

VM setup
AI Coding Tools

Claude Code, Aider, Cline, Continue.dev configuration

Coding tools
Model Management

Choosing models, quantization, Hugging Face downloads

Models
Performance

Benchmarking, context optimization, memory management

Performance
Remote Access

Tailscale integration, API security, remote inference

Remote access

Docker Setup - Container runtime configuration
ZFS Datasets - Storage configuration
GPU Passthrough - VM GPU access
Tailscale Serve - Remote access