Skip to content

Performance

Optimize and measure LLM inference performance on your system.

Overview

Key performance metrics:

  • Tokens/second - Generation speed
  • Time to First Token (TTFT) - Initial response latency
  • Context processing - Prompt evaluation speed
  • Memory usage - VRAM and RAM utilization

Performance Factors

┌─────────────────────────────────────────────────────────────────┐
│                    Performance Equation                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Tokens/sec = f(Memory Bandwidth, GPU Compute, Model Size)      │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Memory Bandwidth│  │  GPU/Compute    │  │   Model Size    │  │
│  │  (Primary)      │  │  (Secondary)    │  │  (Constraint)   │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ Higher = faster │  │ More = faster   │  │ Smaller = faster│  │
│  │ token generation│  │ prompt eval     │  │ for given HW    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Performance Reference

Expected Performance (NVIDIA RTX 4090)

Model Quant Tokens/sec TTFT
7B Q8_0 80-100 <50ms
13B Q4_K_M 60-80 <100ms
34B Q4_K_M 35-50 <200ms
70B Q4_K_M 20-35 <300ms

Expected Performance (AMD GPU/128GB RAM)

Model Quant Tokens/sec Notes
7B Q8_0 50-70 Vulkan or ROCm
34B Q4_K_M 25-40 Good balance
70B Q4_K_M 15-25 Memory bandwidth limited

Topics

Quick Wins

Immediate Optimizations

Optimization Impact Complexity
Use GPU (all layers) 5-20x Low
Optimal quantization 20-50% speed Low
Reduce context 10-30% speed Low
Flash attention 20-40% for long context Medium
Batch requests 2-4x throughput Medium

Hardware Upgrades

Upgrade Impact Cost
More VRAM Run larger models High
Faster memory Better bandwidth High
Better GPU More compute High
NVMe SSD Faster loading Medium

Monitoring

Real-Time Metrics

# GPU utilization (NVIDIA)
nvidia-smi -l 1

# GPU utilization (AMD)
rocm-smi

# Memory pressure
watch -n 1 free -h

# llama.cpp server metrics
curl http://localhost:8080/metrics

Key Indicators

Metric Good Warning
GPU utilization 80-100% <50%
Memory usage <90% >95%
Tokens/sec Model dependent Degrading over time
TTFT <500ms >2s

Bottleneck Identification

Slow Generation?
┌──────────────────┐
│ Check GPU usage  │
└────────┬─────────┘
    ┌────┴────┐
    │         │
    ▼         ▼
  Low?      High?
    │         │
    ▼         ▼
┌─────────┐ ┌─────────┐
│ GPU not │ │ Memory  │
│ used    │ │ bound   │
└────┬────┘ └────┬────┘
     │           │
     ▼           ▼
Add more    Use smaller
GPU layers  quant/model

See Also