Inference Engines¶
Compare and choose the right inference engine for your local LLM deployment on the MS-S1 MAX.
Recommendation for this hardware¶
For Strix Halo (AMD gfx1151) the practical choices are llama.cpp built with HIP and Ollama (which uses llama.cpp under the hood). Everything else either doesn't run on this GPU or doesn't run on Linux:
| Engine | Strix Halo / gfx1151 | Notes |
|---|---|---|
| llama.cpp (HIP) | Yes — recommended | Build with cmake -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151. Best perf and most flexibility. |
| Ollama | Yes — recommended | Auto-detects ROCm on install. Easiest UX. |
| MLX | No — Apple Silicon only | Reference for Mac clients only. |
| vLLM | No — gfx1151 not in supported targets | Re-evaluate when AMD ships official kernels. |
Engine Comparison¶
| Engine | Best For | GPU support (this site) | API | Speed |
|---|---|---|---|---|
| llama.cpp | Flexibility, max perf on Strix Halo | HIP/ROCm + Vulkan + Metal (Mac) | OpenAI-compat | Good |
| Ollama | Ease of use, model library | ROCm (Linux) + Metal (Mac) | OpenAI-compat | Good |
| MLX | Apple Silicon only | Metal only | Python/REST | Excellent (on Mac) |
| vLLM | High-throughput when supported | ROCm CDNA / gfx1100 only — not gfx1151 | OpenAI-compat | Excellent (when supported) |
Feature Matrix¶
| Feature | llama.cpp | Ollama | MLX | vLLM |
|---|---|---|---|---|
| Runs on MS-S1 MAX | Yes | Yes | No | No |
| OpenAI API | Yes | Yes | Via wrapper | Yes |
| Streaming | Yes | Yes | Yes | Yes |
| Batching | Basic / --parallel | Basic | Basic | Advanced |
| GGUF support | Native | Native | Via convert | No |
| Safetensors | Via convert | Via convert | Native | Native |
| Model library / pull | Manual | Built-in | Manual | Manual |
| GPU memory mgmt | Manual | Auto | Auto | Auto |
| Multi-model | Yes (multiple server instances) | Yes (LRU swap) | Yes | Yes |
| Speculative decode | Yes | No | Yes | Yes |
| Continuous batching | No | No | No | Yes |
Performance — Strix Halo (single-stream, ROCm/HIP)¶
Approximate tokens/sec on the MS-S1 MAX after amd-ttm --set 108:
| Model / quant | llama-server | Ollama |
|---|---|---|
| 8B Q4_K_M | ~55-70 | ~50-65 |
| 32B Q4_K_M | ~16-22 | ~15-20 |
| 70B Q4_K_M | ~7-9 | ~6-8 |
| 70B Q6_K | ~4-6 | ~4-5 |
| 70B Q8_0 | ~3-5 | ~3-4 |
| 405B IQ2 | ~1-2 | ~1-2 |
Ollama runs ~5-15% slower than a direct llama-server build because it ships its own llama.cpp binary and adds a small abstraction layer. Worth it for the model-pull/swap UX unless you're squeezing every token.
Reference numbers from other platforms¶
| Platform | Engine | Llama 3.3 70B Q4 |
|---|---|---|
| Apple M4 Max 128GB | MLX | ~30-45 tok/s |
| Apple M4 Max 128GB | llama.cpp (Metal) | ~25-35 tok/s |
| MS-S1 MAX (gfx1151) | llama.cpp (HIP) | ~7-9 tok/s |
The MS-S1 MAX is slower per token on small models but can run far larger models without offloading because of its 128GB unified-memory pool. For context, discrete-GPU rigs with ~24GB of VRAM (RTX 4090 class) typically hit ~45-50 tok/s on a 70B Q4 — much faster on small models, but unable to fit larger weights at all without spilling to system RAM.
API Compatibility¶
All engines provide OpenAI-compatible endpoints:
# Works with any engine
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Hello"}]
}'
Standard endpoints: - POST /v1/chat/completions - Chat completion - POST /v1/completions - Text completion - GET /v1/models - List models - POST /v1/embeddings - Generate embeddings (some engines)
Installation Summary¶
Container Availability¶
| Engine | Official image (this build) | Notes |
|---|---|---|
| llama.cpp | ghcr.io/ggml-org/llama.cpp:server-rocm | ROCm variant for MS-S1 MAX; :server-vulkan as fallback |
| Ollama | ollama/ollama:rocm | ROCm variant; default :latest is CUDA and unused here |
| MLX | N/A (native only) | Metal — run natively on Mac |
| vLLM | vllm/vllm-openai | Not used — does not support gfx1151 today |
See Container Deployment for detailed Docker setups.
Memory Requirements¶
Same model, different engines (70B Q4):
| Engine | Base Memory | With 8K Context | Notes |
|---|---|---|---|
| MLX | ~40GB | ~42GB | Efficient |
| llama.cpp | ~43GB | ~45GB | Slightly higher |
| Ollama | ~43GB | ~45GB | Uses llama.cpp |
| vLLM | ~50GB+ | ~55GB+ | Higher for features |
Choosing Based on Use Case¶
Local Development¶
Recommended: Ollama - Easy model management - Quick to get started - Good enough performance
Production API¶
Recommended on the MS-S1 MAX: llama.cpp (HIP) behind Ollama or a thin proxy - Best stability + flexibility on gfx1151 - Use --parallel to allow concurrent slots
Maximum Performance¶
MS-S1 MAX: llama.cpp (HIP) tuned with --flash-attn --cont-batching Apple Silicon laptop: MLX - Pick the engine that matches the hardware; cross-platform comparisons rarely beat the platform-native option.
Container Deployment¶
Recommended: Ollama or llama.cpp - Good Docker support - GPU passthrough options