Skip to content

Performance¶

Optimize and measure LLM inference performance on your system.

Overview¶

Key performance metrics:

Tokens/second - Generation speed
Time to First Token (TTFT) - Initial response latency
Context processing - Prompt evaluation speed
Memory usage - VRAM and RAM utilization

Performance Factors¶

┌─────────────────────────────────────────────────────────────────┐
│                    Performance Equation                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Tokens/sec = f(Memory Bandwidth, GPU Compute, Model Size)      │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Memory Bandwidth│  │  GPU/Compute    │  │   Model Size    │  │
│  │  (Primary)      │  │  (Secondary)    │  │  (Constraint)   │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ Higher = faster │  │ More = faster   │  │ Smaller = faster│  │
│  │ token generation│  │ prompt eval     │  │ for given HW    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Performance Reference¶

Expected Performance (MS-S1 MAX — AMD Strix Halo iGPU, 128GB unified)¶

Model	Quant	Tokens/sec	Notes
7B	Q8_0	50-70	ROCm/HIP, fits easily
8B	Q4_K_M	50-70	Sweet spot for chat
32B	Q4_K_M	15-20	Good balance
70B	Q4_K_M	6-9	Memory bandwidth limited
405B	Q2_K	1-2	Fits in 128GB, very slow

Expected Performance (Apple Silicon — M-series unified memory)¶

Model	Quant	Tokens/sec	Notes
7B	Q8_0	60-90	Metal/MLX
32B	Q4_K_M	20-30	Memory bandwidth dependent
70B	Q4_K_M	8-12	Requires high-memory M-Max/M-Ultra

Topics¶

Benchmarking

Measure and compare inference performance

Benchmarking
Context Optimization

Balance context length and memory usage

Context optimization
Memory Management

GPU layers, offloading, and multi-model strategies

Memory management

Quick Wins¶

Immediate Optimizations¶

Optimization	Impact	Complexity
Use GPU (all layers)	5-20x	Low
Optimal quantization	20-50% speed	Low
Reduce context	10-30% speed	Low
Flash attention	20-40% for long context	Medium
Batch requests	2-4x throughput	Medium

Hardware Upgrades¶

Upgrade	Impact	Cost
More VRAM	Run larger models	High
Faster memory	Better bandwidth	High
Better GPU	More compute	High
NVMe SSD	Faster loading	Medium

Monitoring¶

Real-Time Metrics¶

# GPU utilization (AMD ROCm — MS-S1 MAX)
rocm-smi
watch -n 1 rocm-smi

# GPU utilization (Apple Silicon, laptop)
sudo powermetrics --samplers gpu_power -i 1000

# Memory pressure
watch -n 1 free -h

# llama.cpp server metrics
curl http://localhost:8080/metrics

Key Indicators¶

Metric	Good	Warning
GPU utilization	80-100%	<50%
Memory usage	<90%	>95%
Tokens/sec	Model dependent	Degrading over time
TTFT	<500ms	>2s

Bottleneck Identification¶

Slow Generation?
       │
       v
┌──────────────────┐
│ Check GPU usage  │
└────────┬─────────┘
         │
    ┌────┴────┐
    │         │
    v         v
  Low?      High?
    │         │
    v         v
┌─────────┐ ┌─────────┐
│ GPU not │ │ Memory  │
│ used    │ │ bound   │
└────┬────┘ └────┬────┘
     │           │
     v           v
Add more    Use smaller
GPU layers  quant/model

See Also¶

Quantization - Size/speed tradeoffs
Choosing Models - Model selection
GPU Containers - GPU configuration