llama.cpp¶
The foundational C/C++ inference engine for running LLMs efficiently on consumer hardware.
Overview¶
llama.cpp provides:
- Cross-platform support - macOS (Metal), Linux (CUDA, Vulkan), Windows
- GGUF format - Optimized quantized model format
- llama-server - OpenAI-compatible API server
- Low dependencies - Minimal runtime requirements
- Active development - Frequent updates and optimizations
Installation¶
macOS (Metal)¶
Build from source for best performance:
# Clone repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with Metal support
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
# Binaries in build/bin/
ls build/bin/
# llama-cli, llama-server, llama-bench, etc.
Linux (CUDA)¶
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Linux (Vulkan)¶
For AMD GPUs or cross-platform:
Pre-built Binaries¶
Download from GitHub Releases:
# Example for macOS
curl -LO https://github.com/ggml-org/llama.cpp/releases/latest/download/llama-server-macos-arm64.zip
unzip llama-server-macos-arm64.zip
llama-server¶
The OpenAI-compatible server for API access.
Basic Usage¶
# Start server with a model
./llama-server \
-m /path/to/model.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 8192 # context length
Recommended Configuration¶
./llama-server \
-m /tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 8192 \
-ngl 99 \ # GPU layers (99 = all)
--threads 8 \ # CPU threads for any CPU work
--parallel 2 \ # Concurrent requests
--cont-batching \ # Enable continuous batching
--metrics # Prometheus metrics
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
-m | Model path | Required |
-c | Context length | 2048 |
-ngl | GPU layers (99 for all) | 0 (CPU) |
--threads | CPU threads | Auto |
--parallel | Concurrent slots | 1 |
--host | Listen address | 127.0.0.1 |
--port | Listen port | 8080 |
--cont-batching | Continuous batching | Off |
--flash-attn | Flash attention | Off |
GPU Layer Allocation¶
Control memory usage with -ngl:
# Full GPU (128GB system, 70B Q4)
-ngl 99 # All layers on GPU
# Partial offload (limited memory)
-ngl 40 # 40 layers on GPU, rest on CPU
# CPU only
-ngl 0
API Endpoints¶
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain recursion."}
],
"temperature": 0.7,
"max_tokens": 500
}'
# Text completion
curl http://localhost:8080/v1/completions \
-d '{"prompt": "The capital of France is", "max_tokens": 20}'
# List models
curl http://localhost:8080/v1/models
# Health check
curl http://localhost:8080/health
Streaming¶
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
Configuration File¶
Create a server config for complex setups:
{
"model": "/tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf",
"host": "0.0.0.0",
"port": 8080,
"ctx_size": 8192,
"n_gpu_layers": 99,
"threads": 8,
"parallel": 2,
"cont_batching": true,
"flash_attn": true
}
Performance Tuning¶
Context Length vs Memory¶
| Context | Memory Impact | Use Case |
|---|---|---|
| 2048 | Baseline | Short prompts |
| 4096 | +~1GB | Standard chat |
| 8192 | +~2GB | Coding tasks |
| 32768 | +~8GB | Long documents |
| 131072 | +~32GB | Full context models |
Flash Attention¶
Reduces memory usage for long contexts:
Requires compilation with Flash Attention support.
Multi-Model Serving¶
Run multiple instances on different ports:
# Terminal 1 - Code model
./llama-server -m deepseek-coder-33b-q4.gguf --port 8081
# Terminal 2 - Chat model
./llama-server -m llama-3.3-70b-q4.gguf --port 8082
Use a reverse proxy to route requests. See Load Balancing.
Benchmarking¶
Use llama-bench to measure performance:
./llama-bench \
-m model.gguf \
-p 512 \ # Prompt tokens
-n 128 \ # Generated tokens
-ngl 99 # GPU layers
# Output shows tokens/sec for prompt processing and generation
See Benchmarking for methodology.
systemd Service¶
Run llama-server as a service:
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Server
After=network.target
[Service]
Type=simple
User=llama
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 8192 \
-ngl 99
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Container Usage¶
See llama.cpp Docker for containerized deployment.
Quick start:
docker run -p 8080:8080 \
-v /tank/ai/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/gguf/llama-3.3-70b-q4.gguf \
-c 4096 -ngl 99
Troubleshooting¶
Model Won't Load¶
# Check available memory
free -h # Linux
memory_pressure # macOS
# Reduce GPU layers
-ngl 30 # Instead of 99
# Use smaller quantization
# Q4_K_M instead of Q6_K
Slow Generation¶
# Verify GPU is being used
# Look for "llama_init_from_gpt_params" output showing GPU layers
# Check Metal is enabled (macOS)
./llama-server -m model.gguf --verbose 2>&1 | grep -i metal
# Reduce context if memory-bound
-c 4096 # Instead of 32768
API Connection Refused¶
See Also¶
- Inference Engines Index - Engine comparison
- llama.cpp Docker - Container deployment
- GGUF Formats - Model format details
- Benchmarking - Performance testing