Inference Engines¶
Compare and choose the right inference engine for your local LLM deployment.
Engine Comparison¶
| Engine | Best For | GPU Support | API | Speed |
|---|---|---|---|---|
| llama.cpp | Flexibility, wide model support | Metal, CUDA, Vulkan | OpenAI-compat | Good |
| Ollama | Ease of use, container deployment | Metal, CUDA | OpenAI-compat | Good |
| MLX | Apple Silicon maximum performance | Metal only | Python/REST | Excellent |
| vLLM | High-throughput serving | CUDA (NVIDIA) | OpenAI-compat | Excellent |
Quick Selection Guide¶
┌────────────────────────────────────────────────────────────┐
│ Choose Your Engine │
└────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│Apple Silicon│ │ NVIDIA │ │ Server │
│ Desktop │ │ GPU │ │ (Multi-GPU)│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ MLX │ │ llama.cpp │ │ vLLM │
│ (fastest) │ │ or Ollama │ │ (batched) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐
│ Ollama │ Want Docker-like UX? → Ollama
│ (easy UX) │ Need max flexibility? → llama.cpp
└─────────────┘
Feature Matrix¶
| Feature | llama.cpp | Ollama | MLX | vLLM |
|---|---|---|---|---|
| OpenAI API | Yes | Yes | Via wrapper | Yes |
| Streaming | Yes | Yes | Yes | Yes |
| Batching | Basic | Basic | Basic | Advanced |
| GGUF support | Native | Native | Via convert | No |
| Safetensors | Via convert | Via convert | Native | Native |
| Model discovery | Manual | Built-in | Manual | Manual |
| GPU memory mgmt | Manual | Auto | Auto | Auto |
| Multi-model | Yes | Yes | Yes | Yes |
| Speculative decode | Yes | No | Yes | Yes |
| Continuous batching | No | No | No | Yes |
Performance Comparison¶
Benchmark on Apple Silicon M4 Max (128GB), Llama 3.3 70B Q4:
| Engine | Tokens/sec | Time to First Token | Notes |
|---|---|---|---|
| MLX | ~45 | ~100ms | Apple Silicon optimized |
| llama.cpp (Metal) | ~35 | ~150ms | Good all-rounder |
| Ollama | ~35 | ~200ms | Overhead for convenience |
Benchmark on NVIDIA RTX 4090, Llama 3.3 70B Q4:
| Engine | Tokens/sec | Notes |
|---|---|---|
| vLLM | ~80+ | Batched requests |
| llama.cpp (CUDA) | ~50 | Single request |
| Ollama | ~48 | Single request |
API Compatibility¶
All engines provide OpenAI-compatible endpoints:
# Works with any engine
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Hello"}]
}'
Standard endpoints: - POST /v1/chat/completions - Chat completion - POST /v1/completions - Text completion - GET /v1/models - List models - POST /v1/embeddings - Generate embeddings (some engines)
Installation Summary¶
Container Availability¶
| Engine | Official Image | GPU Support |
|---|---|---|
| llama.cpp | ghcr.io/ggml-org/llama.cpp:server | CUDA, ROCm |
| Ollama | ollama/ollama | CUDA, ROCm |
| MLX | N/A (native only) | Metal (native) |
| vLLM | vllm/vllm-openai | CUDA |
See Container Deployment for detailed Docker setups.
Memory Requirements¶
Same model, different engines (70B Q4):
| Engine | Base Memory | With 8K Context | Notes |
|---|---|---|---|
| MLX | ~40GB | ~42GB | Efficient |
| llama.cpp | ~43GB | ~45GB | Slightly higher |
| Ollama | ~43GB | ~45GB | Uses llama.cpp |
| vLLM | ~50GB+ | ~55GB+ | Higher for features |
Choosing Based on Use Case¶
Local Development¶
Recommended: Ollama - Easy model management - Quick to get started - Good enough performance
Production API¶
Recommended: vLLM (NVIDIA) or llama.cpp (Apple Silicon) - vLLM for high throughput - llama.cpp for stability and flexibility
Maximum Performance¶
Recommended: MLX (Apple Silicon) or vLLM (NVIDIA) - Best tokens/sec - Optimized for their platforms
Container Deployment¶
Recommended: Ollama or llama.cpp - Good Docker support - GPU passthrough options