AI & Local LLMs¶
Run large language models locally on a 128GB unified memory system for privacy, cost savings, and low latency.
Why Local LLMs?¶
| Benefit | Description |
|---|---|
| Privacy | Data never leaves your machine |
| Cost | No API fees after hardware investment |
| Latency | Sub-100ms first token for local inference |
| Offline | Works without internet connection |
| Control | Choose models, tune parameters, no rate limits |
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Claude Code │ │ Aider │ │ Cline / Continue.dev │ │
│ └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘ │
│ │ │ │ │
│ └────────────────┴──────────────────────┘ │
│ │ │
│ OpenAI-Compatible API │
│ │ │
├──────────────────────────┼───────────────────────────────────────┤
│ Inference Layer │
│ +-------------+ +-------------+ +---------------------+ │
│ | Ollama | | llama-server| | Native build | │
│ | (ROCm) | | (ROCm/HIP) | | (cmake -DGGML_HIP=ON)| │
│ +------+------+ +------+------+ +----------+-----------+ │
│ | | | │
├─────────┴────────────────┴────────────────────┴──────────────────┤
│ Storage Layer │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ tank/ai/models (ZFS Dataset) │ │
│ │ recordsize=1M │ compression=off (models already compressed)│ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Engine choices for Strix Halo / gfx1151
On this hardware the practical engines are llama.cpp (HIP build) and Ollama (which uses llama.cpp under the hood). MLX is Apple-Silicon-only, and vLLM does not currently ship a working build for gfx1151 — keep both off the recommended path until that changes. See Inference Engines for details.
Quick Start Paths¶
Path 1: Web UI in front of Ollama (easiest)¶
- Install ROCm and verify
gfx1151is detected. - Install Ollama natively (it auto-detects ROCm).
- Deploy Open WebUI as a Docker container; point it at the host Ollama.
- Pull a 70B Q4 model and chat from any browser.
Path 2: Container stack (compose-managed)¶
- Create the ZFS dataset for model files.
- Deploy the Ollama container with
/dev/kfd+/dev/dridevice mounts. - Expose Ollama's OpenAI-compatible endpoint on the LAN (or only via Tailscale).
Path 3: Maximum performance (native HIP build)¶
- Install ROCm 7.x.
- Build llama.cpp with HIP for
gfx1151. - Run
llama-serverdirectly under systemd; tune--parallel,-ngl 99, KV-cache quantization.
What You Can Run on 128GB¶
Assuming GTT is sized to ~108 GB (see Memory Configuration), with ~20 GB reserved for the OS / ARC / VMs:
| Model Size | Quantization | Memory | Example Models |
|---|---|---|---|
| 7-8B | Q8_0 | ~10 GB | Llama 3.x 8B, Mistral 7B, Qwen3 8B |
| 32-34B | Q4_K_M | ~20 GB | Qwen3 32B, DeepSeek Coder, Codestral |
| 70B | Q4_K_M | ~40 GB | Llama 3.3 70B, Qwen3 72B |
| 70B | Q6_K | ~55 GB | Higher-quality 70B with long context |
| 70B | Q8_0 | ~75 GB | Near-fp16 quality 70B |
| 120-200B MoE | Q4/Q8 | 60-110 GB | DeepSeek-V3-class, GPT-OSS-120B-MXFP4 |
| 405B | IQ2/IQ1 | ~100 GB | Llama 3.1 405B, very tight context |
Real-world tok/s on this box (ROCm/HIP, single-stream):
| Model | Tokens/sec (gen) |
|---|---|
| 8B Q4 | ~50-70 |
| 32B Q4 | ~15-20 |
| 70B Q4 | ~6-9 |
| 70B Q6 | ~4-6 |
| 405B IQ2 | ~1-2 |
Model recommendations rot fast — verify on ollama.com or huggingface.co/models on the day you set this up.
Section Overview¶
-
Fundamentals
Why local LLMs, unified memory advantages, architecture decisions
-
Inference Engines
llama.cpp (HIP) and Ollama on Strix Halo; MLX/vLLM noted but not for this box
-
GUI Tools
LM Studio, Jan.ai, Open WebUI - visual interfaces
-
Container Deployment
Docker setups for Ollama and llama.cpp with ZFS storage
-
API Serving
OpenAI-compatible endpoints, LocalAI, load balancing
-
VM Integration
Call the host's Ollama API from VMs (the host owns the iGPU)
-
AI Coding Tools
Claude Code, Aider, Cline, Continue.dev configuration
-
Model Management
Choosing models, quantization, Hugging Face downloads
-
Performance
Benchmarking, context optimization, memory management
-
Remote Access
Tailscale integration, API security, remote inference
Related Documentation¶
- Docker Setup - Container runtime configuration
- ZFS Datasets - Storage configuration
- GPU Passthrough - VM GPU access
- Tailscale Serve - Remote access