AI & Local LLMs¶
Run large language models locally on a 128GB unified memory system for privacy, cost savings, and low latency.
Why Local LLMs?¶
| Benefit | Description |
|---|---|
| Privacy | Data never leaves your machine |
| Cost | No API fees after hardware investment |
| Latency | Sub-100ms first token for local inference |
| Offline | Works without internet connection |
| Control | Choose models, tune parameters, no rate limits |
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Claude Code │ │ Aider │ │ Cline / Continue.dev │ │
│ └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘ │
│ │ │ │ │
│ └────────────────┴──────────────────────┘ │
│ │ │
│ OpenAI-Compatible API │
│ │ │
├──────────────────────────┼───────────────────────────────────────┤
│ Inference Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Ollama │ │ llama.cpp │ │ MLX │ Native │
│ │ (Docker) │ │ (Docker) │ │ (Native) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
├─────────┴────────────────┴────────────────┴──────────────────────┤
│ Storage Layer │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ tank/ai/models (ZFS Dataset) │ │
│ │ recordsize=1M │ compression=zstd │ ~500GB capacity │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Quick Start Paths¶
Path 1: Fastest Setup (GUI)¶
- Install LM Studio - download and run
- Download a model (Llama 3.3 70B Q4)
- Start local server, connect coding tools
Path 2: Server/Container Setup¶
- Create ZFS dataset for model storage
- Deploy Ollama container
- Pull models and expose OpenAI-compatible API
Path 3: Maximum Performance¶
- Install MLX for Apple Silicon
- Download GGUF models from Hugging Face
- Run inference with Metal acceleration
What You Can Run on 128GB¶
| Model Size | Quantization | VRAM Usage | Example Models |
|---|---|---|---|
| 7-8B | Q8_0 | ~10GB | Llama 3.2, Mistral 7B |
| 32-34B | Q5_K_M | ~25GB | Qwen 2.5 32B, DeepSeek Coder 33B |
| 70B | Q4_K_M | ~43GB | Llama 3.3 70B, Qwen 2.5 72B |
| 70B | Q6_K | ~58GB | Higher quality 70B |
| 405B | Q2_K | ~95GB | Llama 3.1 405B (limited context) |
The 75% VRAM rule: Reserve 25% of unified memory for system overhead. On 128GB, target ~96GB for models.
Section Overview¶
-
Fundamentals
Why local LLMs, unified memory advantages, architecture decisions
-
Inference Engines
llama.cpp, Ollama, MLX, vLLM - when to use each
-
GUI Tools
LM Studio, Jan.ai, Open WebUI - visual interfaces
-
Container Deployment
Docker setups for Ollama and llama.cpp with ZFS storage
-
API Serving
OpenAI-compatible endpoints, LocalAI, load balancing
-
VM Integration
LM Studio in Windows VM with GPU passthrough
-
AI Coding Tools
Claude Code, Aider, Cline, Continue.dev configuration
-
Model Management
Choosing models, quantization, Hugging Face downloads
-
Performance
Benchmarking, context optimization, memory management
-
Remote Access
Tailscale integration, API security, remote inference
Related Documentation¶
- Docker Setup - Container runtime configuration
- ZFS Datasets - Storage configuration
- GPU Passthrough - VM GPU access
- Tailscale Serve - Remote access