Skip to content

Performance¶

Optimize and measure LLM inference performance on your system.

Overview¶

Key performance metrics:

Tokens/second - Generation speed
Time to First Token (TTFT) - Initial response latency
Context processing - Prompt evaluation speed
Memory usage - VRAM and RAM utilization

Performance Factors¶

┌─────────────────────────────────────────────────────────────────┐
│                    Performance Equation                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Tokens/sec = f(Memory Bandwidth, GPU Compute, Model Size)      │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Memory Bandwidth│  │  GPU/Compute    │  │   Model Size    │  │
│  │  (Primary)      │  │  (Secondary)    │  │  (Constraint)   │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ Higher = faster │  │ More = faster   │  │ Smaller = faster│  │
│  │ token generation│  │ prompt eval     │  │ for given HW    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Performance Reference¶

Expected Performance (NVIDIA RTX 4090)¶

Model	Quant	Tokens/sec	TTFT
7B	Q8_0	80-100	<50ms
13B	Q4_K_M	60-80	<100ms
34B	Q4_K_M	35-50	<200ms
70B	Q4_K_M	20-35	<300ms

Expected Performance (AMD GPU/128GB RAM)¶

Model	Quant	Tokens/sec	Notes
7B	Q8_0	50-70	Vulkan or ROCm
34B	Q4_K_M	25-40	Good balance
70B	Q4_K_M	15-25	Memory bandwidth limited

Topics¶

Benchmarking

Measure and compare inference performance

Benchmarking
Context Optimization

Balance context length and memory usage

Context optimization
Memory Management

GPU layers, offloading, and multi-model strategies

Memory management

Quick Wins¶

Immediate Optimizations¶

Optimization	Impact	Complexity
Use GPU (all layers)	5-20x	Low
Optimal quantization	20-50% speed	Low
Reduce context	10-30% speed	Low
Flash attention	20-40% for long context	Medium
Batch requests	2-4x throughput	Medium

Hardware Upgrades¶

Upgrade	Impact	Cost
More VRAM	Run larger models	High
Faster memory	Better bandwidth	High
Better GPU	More compute	High
NVMe SSD	Faster loading	Medium

Monitoring¶

Real-Time Metrics¶

# GPU utilization (NVIDIA)
nvidia-smi -l 1

# GPU utilization (AMD)
rocm-smi

# Memory pressure
watch -n 1 free -h

# llama.cpp server metrics
curl http://localhost:8080/metrics

Key Indicators¶

Metric	Good	Warning
GPU utilization	80-100%	<50%
Memory usage	<90%	>95%
Tokens/sec	Model dependent	Degrading over time
TTFT	<500ms	>2s

Bottleneck Identification¶

Slow Generation?
       │
       ▼
┌──────────────────┐
│ Check GPU usage  │
└────────┬─────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
  Low?      High?
    │         │
    ▼         ▼
┌─────────┐ ┌─────────┐
│ GPU not │ │ Memory  │
│ used    │ │ bound   │
└────┬────┘ └────┬────┘
     │           │
     ▼           ▼
Add more    Use smaller
GPU layers  quant/model

See Also¶

Quantization - Size/speed tradeoffs
Choosing Models - Model selection
GPU Containers - GPU configuration