AI & Local LLMs¶

Run large language models locally on a 128GB unified memory system for privacy, cost savings, and low latency.

Why Local LLMs?¶

Benefit	Description
Privacy	Data never leaves your machine
Cost	No API fees after hardware investment
Latency	Sub-100ms first token for local inference
Offline	Works without internet connection
Control	Choose models, tune parameters, no rate limits

Architecture Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                        Client Layer                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ Claude Code │  │   Aider     │  │  Cline / Continue.dev   │  │
│  └──────┬──────┘  └──────┬──────┘  └────────────┬────────────┘  │
│         │                │                      │                │
│         └────────────────┴──────────────────────┘                │
│                          │                                       │
│                   OpenAI-Compatible API                          │
│                          │                                       │
├──────────────────────────┼───────────────────────────────────────┤
│                    Inference Layer                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   Ollama    │  │ llama.cpp   │  │     MLX     │   Native    │
│  │  (Docker)   │  │  (Docker)   │  │   (Native)  │              │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │
│         │                │                │                      │
├─────────┴────────────────┴────────────────┴──────────────────────┤
│                     Storage Layer                                │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              tank/ai/models (ZFS Dataset)                  │  │
│  │     recordsize=1M │ compression=zstd │ ~500GB capacity     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Quick Start Paths¶

Path 1: Fastest Setup (GUI)¶

Install LM Studio - download and run
Download a model (Llama 3.3 70B Q4)
Start local server, connect coding tools

Path 2: Server/Container Setup¶

Create ZFS dataset for model storage
Deploy Ollama container
Pull models and expose OpenAI-compatible API

Path 3: Maximum Performance¶

Install MLX for Apple Silicon
Download GGUF models from Hugging Face
Run inference with Metal acceleration

What You Can Run on 128GB¶

Model Size	Quantization	VRAM Usage	Example Models
7-8B	Q8_0	~10GB	Llama 3.2, Mistral 7B
32-34B	Q5_K_M	~25GB	Qwen 2.5 32B, DeepSeek Coder 33B
70B	Q4_K_M	~43GB	Llama 3.3 70B, Qwen 2.5 72B
70B	Q6_K	~58GB	Higher quality 70B
405B	Q2_K	~95GB	Llama 3.1 405B (limited context)

The 75% VRAM rule: Reserve 25% of unified memory for system overhead. On 128GB, target ~96GB for models.

Section Overview¶

Fundamentals

Why local LLMs, unified memory advantages, architecture decisions

Learn basics
Inference Engines

llama.cpp, Ollama, MLX, vLLM - when to use each

Compare engines
GUI Tools

LM Studio, Jan.ai, Open WebUI - visual interfaces

GUI options
Container Deployment

Docker setups for Ollama and llama.cpp with ZFS storage

Containers
API Serving

OpenAI-compatible endpoints, LocalAI, load balancing

API setup
VM Integration

LM Studio in Windows VM with GPU passthrough

VM setup
AI Coding Tools

Claude Code, Aider, Cline, Continue.dev configuration

Coding tools
Model Management

Choosing models, quantization, Hugging Face downloads

Models
Performance

Benchmarking, context optimization, memory management

Performance
Remote Access

Tailscale integration, API security, remote inference

Remote access

Docker Setup - Container runtime configuration
ZFS Datasets - Storage configuration
GPU Passthrough - VM GPU access
Tailscale Serve - Remote access