Skip to content

AI & Local LLMs

Run large language models locally on a 128GB unified memory system for privacy, cost savings, and low latency.

Why Local LLMs?

Benefit Description
Privacy Data never leaves your machine
Cost No API fees after hardware investment
Latency Sub-100ms first token for local inference
Offline Works without internet connection
Control Choose models, tune parameters, no rate limits

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Client Layer                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ Claude Code │  │   Aider     │  │  Cline / Continue.dev   │  │
│  └──────┬──────┘  └──────┬──────┘  └────────────┬────────────┘  │
│         │                │                      │                │
│         └────────────────┴──────────────────────┘                │
│                          │                                       │
│                   OpenAI-Compatible API                          │
│                          │                                       │
├──────────────────────────┼───────────────────────────────────────┤
│                    Inference Layer                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   Ollama    │  │ llama.cpp   │  │     MLX     │   Native    │
│  │  (Docker)   │  │  (Docker)   │  │   (Native)  │              │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │
│         │                │                │                      │
├─────────┴────────────────┴────────────────┴──────────────────────┤
│                     Storage Layer                                │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              tank/ai/models (ZFS Dataset)                  │  │
│  │     recordsize=1M │ compression=zstd │ ~500GB capacity     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Quick Start Paths

Path 1: Fastest Setup (GUI)

  1. Install LM Studio - download and run
  2. Download a model (Llama 3.3 70B Q4)
  3. Start local server, connect coding tools

Path 2: Server/Container Setup

  1. Create ZFS dataset for model storage
  2. Deploy Ollama container
  3. Pull models and expose OpenAI-compatible API

Path 3: Maximum Performance

  1. Install MLX for Apple Silicon
  2. Download GGUF models from Hugging Face
  3. Run inference with Metal acceleration

What You Can Run on 128GB

Model Size Quantization VRAM Usage Example Models
7-8B Q8_0 ~10GB Llama 3.2, Mistral 7B
32-34B Q5_K_M ~25GB Qwen 2.5 32B, DeepSeek Coder 33B
70B Q4_K_M ~43GB Llama 3.3 70B, Qwen 2.5 72B
70B Q6_K ~58GB Higher quality 70B
405B Q2_K ~95GB Llama 3.1 405B (limited context)

The 75% VRAM rule: Reserve 25% of unified memory for system overhead. On 128GB, target ~96GB for models.

Section Overview

  • Fundamentals


    Why local LLMs, unified memory advantages, architecture decisions

    Learn basics

  • Inference Engines


    llama.cpp, Ollama, MLX, vLLM - when to use each

    Compare engines

  • GUI Tools


    LM Studio, Jan.ai, Open WebUI - visual interfaces

    GUI options

  • Container Deployment


    Docker setups for Ollama and llama.cpp with ZFS storage

    Containers

  • API Serving


    OpenAI-compatible endpoints, LocalAI, load balancing

    API setup

  • VM Integration


    LM Studio in Windows VM with GPU passthrough

    VM setup

  • AI Coding Tools


    Claude Code, Aider, Cline, Continue.dev configuration

    Coding tools

  • Model Management


    Choosing models, quantization, Hugging Face downloads

    Models

  • Performance


    Benchmarking, context optimization, memory management

    Performance

  • Remote Access


    Tailscale integration, API security, remote inference

    Remote access