Skip to content

llama.cpp

The foundational C/C++ inference engine for running LLMs efficiently on consumer hardware.

Overview

llama.cpp provides:

  • Cross-platform support - macOS (Metal), Linux (CUDA, Vulkan), Windows
  • GGUF format - Optimized quantized model format
  • llama-server - OpenAI-compatible API server
  • Low dependencies - Minimal runtime requirements
  • Active development - Frequent updates and optimizations

Installation

macOS (Metal)

Build from source for best performance:

# Clone repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with Metal support
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Binaries in build/bin/
ls build/bin/
# llama-cli, llama-server, llama-bench, etc.

Linux (CUDA)

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Linux (Vulkan)

For AMD GPUs or cross-platform:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Pre-built Binaries

Download from GitHub Releases:

# Example for macOS
curl -LO https://github.com/ggml-org/llama.cpp/releases/latest/download/llama-server-macos-arm64.zip
unzip llama-server-macos-arm64.zip

llama-server

The OpenAI-compatible server for API access.

Basic Usage

# Start server with a model
./llama-server \
  -m /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192  # context length
./llama-server \
  -m /tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99 \           # GPU layers (99 = all)
  --threads 8 \       # CPU threads for any CPU work
  --parallel 2 \      # Concurrent requests
  --cont-batching \   # Enable continuous batching
  --metrics           # Prometheus metrics

Key Parameters

Parameter Description Default
-m Model path Required
-c Context length 2048
-ngl GPU layers (99 for all) 0 (CPU)
--threads CPU threads Auto
--parallel Concurrent slots 1
--host Listen address 127.0.0.1
--port Listen port 8080
--cont-batching Continuous batching Off
--flash-attn Flash attention Off

GPU Layer Allocation

Control memory usage with -ngl:

# Full GPU (128GB system, 70B Q4)
-ngl 99  # All layers on GPU

# Partial offload (limited memory)
-ngl 40  # 40 layers on GPU, rest on CPU

# CPU only
-ngl 0

API Endpoints

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain recursion."}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

# Text completion
curl http://localhost:8080/v1/completions \
  -d '{"prompt": "The capital of France is", "max_tokens": 20}'

# List models
curl http://localhost:8080/v1/models

# Health check
curl http://localhost:8080/health

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Configuration File

Create a server config for complex setups:

{
  "model": "/tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf",
  "host": "0.0.0.0",
  "port": 8080,
  "ctx_size": 8192,
  "n_gpu_layers": 99,
  "threads": 8,
  "parallel": 2,
  "cont_batching": true,
  "flash_attn": true
}
./llama-server --config config.json

Performance Tuning

Context Length vs Memory

Context Memory Impact Use Case
2048 Baseline Short prompts
4096 +~1GB Standard chat
8192 +~2GB Coding tasks
32768 +~8GB Long documents
131072 +~32GB Full context models

Flash Attention

Reduces memory usage for long contexts:

./llama-server -m model.gguf --flash-attn -c 32768

Requires compilation with Flash Attention support.

Multi-Model Serving

Run multiple instances on different ports:

# Terminal 1 - Code model
./llama-server -m deepseek-coder-33b-q4.gguf --port 8081

# Terminal 2 - Chat model
./llama-server -m llama-3.3-70b-q4.gguf --port 8082

Use a reverse proxy to route requests. See Load Balancing.

Benchmarking

Use llama-bench to measure performance:

./llama-bench \
  -m model.gguf \
  -p 512 \      # Prompt tokens
  -n 128 \      # Generated tokens
  -ngl 99       # GPU layers

# Output shows tokens/sec for prompt processing and generation

See Benchmarking for methodology.

systemd Service

Run llama-server as a service:

# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp Server
After=network.target

[Service]
Type=simple
User=llama
ExecStart=/opt/llama.cpp/build/bin/llama-server \
  -m /tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server

Container Usage

See llama.cpp Docker for containerized deployment.

Quick start:

docker run -p 8080:8080 \
  -v /tank/ai/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/gguf/llama-3.3-70b-q4.gguf \
  -c 4096 -ngl 99

Troubleshooting

Model Won't Load

# Check available memory
free -h  # Linux
memory_pressure  # macOS

# Reduce GPU layers
-ngl 30  # Instead of 99

# Use smaller quantization
# Q4_K_M instead of Q6_K

Slow Generation

# Verify GPU is being used
# Look for "llama_init_from_gpt_params" output showing GPU layers

# Check Metal is enabled (macOS)
./llama-server -m model.gguf --verbose 2>&1 | grep -i metal

# Reduce context if memory-bound
-c 4096  # Instead of 32768

API Connection Refused

# Bind to all interfaces
--host 0.0.0.0

# Check firewall
sudo ufw allow 8080/tcp

See Also