Skip to content

Inference Engines

Compare and choose the right inference engine for your local LLM deployment on the MS-S1 MAX.

Recommendation for this hardware

For Strix Halo (AMD gfx1151) the practical choices are llama.cpp built with HIP and Ollama (which uses llama.cpp under the hood). Everything else either doesn't run on this GPU or doesn't run on Linux:

Engine Strix Halo / gfx1151 Notes
llama.cpp (HIP) Yes — recommended Build with cmake -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151. Best perf and most flexibility.
Ollama Yes — recommended Auto-detects ROCm on install. Easiest UX.
MLX No — Apple Silicon only Reference for Mac clients only.
vLLM No — gfx1151 not in supported targets Re-evaluate when AMD ships official kernels.

Engine Comparison

Engine Best For GPU support (this site) API Speed
llama.cpp Flexibility, max perf on Strix Halo HIP/ROCm + Vulkan + Metal (Mac) OpenAI-compat Good
Ollama Ease of use, model library ROCm (Linux) + Metal (Mac) OpenAI-compat Good
MLX Apple Silicon only Metal only Python/REST Excellent (on Mac)
vLLM High-throughput when supported ROCm CDNA / gfx1100 only — not gfx1151 OpenAI-compat Excellent (when supported)

Feature Matrix

Feature llama.cpp Ollama MLX vLLM
Runs on MS-S1 MAX Yes Yes No No
OpenAI API Yes Yes Via wrapper Yes
Streaming Yes Yes Yes Yes
Batching Basic / --parallel Basic Basic Advanced
GGUF support Native Native Via convert No
Safetensors Via convert Via convert Native Native
Model library / pull Manual Built-in Manual Manual
GPU memory mgmt Manual Auto Auto Auto
Multi-model Yes (multiple server instances) Yes (LRU swap) Yes Yes
Speculative decode Yes No Yes Yes
Continuous batching No No No Yes

Performance — Strix Halo (single-stream, ROCm/HIP)

Approximate tokens/sec on the MS-S1 MAX after amd-ttm --set 108:

Model / quant llama-server Ollama
8B Q4_K_M ~55-70 ~50-65
32B Q4_K_M ~16-22 ~15-20
70B Q4_K_M ~7-9 ~6-8
70B Q6_K ~4-6 ~4-5
70B Q8_0 ~3-5 ~3-4
405B IQ2 ~1-2 ~1-2

Ollama runs ~5-15% slower than a direct llama-server build because it ships its own llama.cpp binary and adds a small abstraction layer. Worth it for the model-pull/swap UX unless you're squeezing every token.

Reference numbers from other platforms

Platform Engine Llama 3.3 70B Q4
Apple M4 Max 128GB MLX ~30-45 tok/s
Apple M4 Max 128GB llama.cpp (Metal) ~25-35 tok/s
MS-S1 MAX (gfx1151) llama.cpp (HIP) ~7-9 tok/s

The MS-S1 MAX is slower per token on small models but can run far larger models without offloading because of its 128GB unified-memory pool. For context, discrete-GPU rigs with ~24GB of VRAM (RTX 4090 class) typically hit ~45-50 tok/s on a 70B Q4 — much faster on small models, but unable to fit larger weights at all without spilling to system RAM.

API Compatibility

All engines provide OpenAI-compatible endpoints:

# Works with any engine
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Standard endpoints: - POST /v1/chat/completions - Chat completion - POST /v1/completions - Text completion - GET /v1/models - List models - POST /v1/embeddings - Generate embeddings (some engines)

Installation Summary

# Build from source (macOS)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

# Run server
./build/bin/llama-server -m model.gguf -c 4096
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start and run
ollama serve
ollama run llama3.3
# Requires Apple Silicon
pip install mlx-lm

# Run inference
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit
# vLLM does not support the MS-S1 MAX (gfx1151) today.
# Listed here for reference only — works on ROCm CDNA / gfx1100 or
# other supported GPUs. Re-evaluate when AMD ships official gfx1151
# kernels.
pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct

Container Availability

Engine Official image (this build) Notes
llama.cpp ghcr.io/ggml-org/llama.cpp:server-rocm ROCm variant for MS-S1 MAX; :server-vulkan as fallback
Ollama ollama/ollama:rocm ROCm variant; default :latest is CUDA and unused here
MLX N/A (native only) Metal — run natively on Mac
vLLM vllm/vllm-openai Not used — does not support gfx1151 today

See Container Deployment for detailed Docker setups.

Memory Requirements

Same model, different engines (70B Q4):

Engine Base Memory With 8K Context Notes
MLX ~40GB ~42GB Efficient
llama.cpp ~43GB ~45GB Slightly higher
Ollama ~43GB ~45GB Uses llama.cpp
vLLM ~50GB+ ~55GB+ Higher for features

Choosing Based on Use Case

Local Development

Recommended: Ollama - Easy model management - Quick to get started - Good enough performance

Production API

Recommended on the MS-S1 MAX: llama.cpp (HIP) behind Ollama or a thin proxy - Best stability + flexibility on gfx1151 - Use --parallel to allow concurrent slots

Maximum Performance

MS-S1 MAX: llama.cpp (HIP) tuned with --flash-attn --cont-batching Apple Silicon laptop: MLX - Pick the engine that matches the hardware; cross-platform comparisons rarely beat the platform-native option.

Container Deployment

Recommended: Ollama or llama.cpp - Good Docker support - GPU passthrough options

See Also

  • llama.cpp - Detailed setup guide
  • Ollama - Docker-like LLM runner
  • MLX - Apple Silicon optimization
  • vLLM - High-throughput serving