Unified Memory for LLMs¶

Apple Silicon's unified memory architecture provides unique advantages for running large language models.

Understanding Unified Memory¶

Unlike discrete GPUs with separate VRAM, Apple Silicon shares memory between CPU and GPU:

Traditional Architecture:
┌─────────────┐     PCIe      ┌─────────────┐
│    CPU      │◄────────────►│    GPU      │
│  (64GB RAM) │               │ (24GB VRAM) │
└─────────────┘               └─────────────┘
     │                              │
     ▼                              ▼
  System RAM                   GPU VRAM
  (Unused by GPU)             (Model lives here)

Apple Silicon Unified:
┌─────────────────────────────────────────┐
│            M-Series SoC                 │
│  ┌─────────┐         ┌─────────┐        │
│  │   CPU   │◄───────►│   GPU   │        │
│  └────┬────┘         └────┬────┘        │
│       │                   │             │
│       ▼                   ▼             │
│  ┌──────────────────────────────────┐   │
│  │      Unified Memory (128GB)      │   │
│  │      Shared by CPU + GPU         │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

Memory Advantages¶

High Capacity¶

Platform	Typical Max	Notes
Consumer GPUs	24GB	RTX 4090, limits to ~13B at Q8
Pro GPUs	48-80GB	A6000, H100 - expensive
Mac Studio M2 Ultra	192GB	Unified, accessible
Mac Studio M4 Max	128GB	Unified, accessible

Bandwidth¶

M4-series memory bandwidth:

Chip	Bandwidth	Notes
M4	120 GB/s	Base chip
M4 Pro	273 GB/s	Good for inference
M4 Max	546 GB/s	Excellent for large models

Token generation is memory-bandwidth bound. Higher bandwidth = faster tokens/sec.

The 75% Rule¶

Reserve 25% of unified memory for system overhead:

Total Memory	Available for Models	Comfortable Model Size
32GB	24GB	7-13B at Q4-Q8
64GB	48GB	34B at Q4-Q6, 70B at Q2
96GB	72GB	70B at Q4-Q5
128GB	96GB	70B at Q6, 405B at Q2
192GB	144GB	70B at Q8, 405B at Q3-Q4

Memory Calculation¶

Estimate VRAM requirements:

VRAM (GB) ≈ Parameters (B) × Bits / 8

Examples:
- 70B Q4 (4-bit): 70 × 4 / 8 = 35GB base
- 70B Q4 with context: ~43GB typical
- 405B Q2 (2-bit): 405 × 2 / 8 = 101GB base

Add 20-30% overhead for: - KV cache (scales with context length) - Activation memory - System buffers

Monitoring Memory Usage¶

macOS Activity Monitor¶

# Memory pressure indicator
memory_pressure

# Detailed memory stats
vm_stat

During Inference¶

Monitor in real-time:

# Watch memory usage
watch -n 1 'memory_pressure | head -5'

# For Ollama specifically
ollama ps  # Shows loaded models and memory

Signs of Memory Pressure¶

Symptom	Cause	Solution
Slow generation	Swap usage	Reduce model size or context
System unresponsive	Memory exhausted	Lower GPU layers
Model fails to load	Insufficient memory	Use higher quantization

Optimizing Memory Usage¶

Context Length Tradeoffs¶

KV cache grows linearly with context:

Context	Additional Memory	Use Case
4K	~1GB	Short conversations
8K	~2GB	Standard coding
32K	~8GB	Large file context
128K	~32GB	Repository-wide context

Multi-Model Strategies¶

Running multiple models simultaneously:

# Example: Code + Chat models
# DeepSeek Coder 33B Q4: ~20GB
# Llama 3.2 8B Q8: ~10GB
# Total: ~30GB, leaves room for 70B main model

Offloading Options¶

When models exceed available memory:

Strategy	Tradeoff
More quantization	Reduce quality slightly
Fewer GPU layers	Slower inference (CPU fallback)
Smaller context	Less conversation history
Unload unused models	Manual model switching

Apple Silicon Tiers¶

Model recommendations by chip:

Chip	Memory	Recommended Models
M1/M2/M3	8-16GB	7B models only
M1/M2/M3 Pro	16-36GB	Up to 13B
M1/M2/M3 Max	32-96GB	Up to 70B (Q4 on 64GB)
M1/M2/M3/M4 Ultra	64-192GB	70B comfortably, 405B possible
M4 Max	128GB	70B at Q6, 405B at Q2

Neural Engine¶

M-series chips include a Neural Engine, but current LLM frameworks primarily use GPU:

Component	Used By	Performance
GPU (Metal)	MLX, llama.cpp	Primary inference path
Neural Engine	CoreML (limited)	Not widely supported for LLMs
CPU	Fallback	10-50x slower than GPU

MLX can achieve up to 87% performance improvement over llama.cpp by better utilizing Metal.