vLLM¶

Not currently usable on the MS-S1 MAX

vLLM's ROCm support targets CDNA/Instinct (gfx9xx) and a small set of RDNA3 dGPUs (gfx1100). The Strix Halo iGPU (gfx1151) is not on the supported target list as of ROCm 7.x — vLLM will either fail to build kernels for it or crash at model load. This page is kept as reference only; on this hardware use llama.cpp HIP or Ollama. Re-evaluate vLLM if and when AMD ships official gfx1151 kernels upstream.

High-throughput LLM serving engine with PagedAttention and continuous batching. Listed here so readers comparing engines understand what vLLM offers and why it isn't selected for this build.

Overview¶

vLLM provides:

14-24x throughput vs HuggingFace Transformers
PagedAttention - Efficient KV cache memory management
Continuous batching - Dynamic request scheduling
OpenAI-compatible API - Drop-in replacement
Multi-GPU support - Tensor and pipeline parallelism

Hardware Support (reference only)¶

vLLM is GPU-first and primarily targets datacenter accelerators:

Backend	Status	Notes
AMD ROCm — CDNA / Instinct (`gfx9xx`)	Supported	MI200/MI300 class
AMD ROCm — RDNA3 dGPU (`gfx1100`)	Supported	RX 7900 XTX and similar
AMD ROCm — RDNA3.5 iGPU (`gfx1151`, Strix Halo)	Not supported	The MS-S1 MAX falls here
Apple Silicon (Metal / MLX)	Not supported	Use MLX or llama.cpp instead
CPU-only	Experimental	Not a production path

For the MS-S1 MAX specifically, the supported alternatives that cover the same use cases are:

High concurrency / OpenAI-compatible serving -> llama.cpp llama-server with --cont-batching --parallel N
Easy model management -> Ollama (ROCm image)
Multimodal / extra endpoints -> LocalAI (localai/localai:latest-gpu-hipblas)

Quick Start (reference)¶

The commands below show what vLLM usage looks like on supported hardware. They are not expected to work on the MS-S1 MAX.

Command Line¶

# Start OpenAI-compatible server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Python API¶

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Key Server Parameters¶

Parameter	Description	Default
`--model`	Model name or path	Required
`--host`	Listen address	`localhost`
`--port`	Listen port	`8000`
`--tensor-parallel-size`	GPUs for tensor parallelism	1
`--pipeline-parallel-size`	GPUs for pipeline parallelism	1
`--max-model-len`	Maximum context length	Model default
`--gpu-memory-utilization`	GPU memory fraction	0.9
`--quantization`	Quantization method	None
`--dtype`	Data type	`auto`

Concepts Worth Knowing¶

Even though vLLM isn't deployed on this build, two of its concepts are useful context for tuning llama.cpp's llama-server:

Continuous Batching¶

vLLM popularized continuous batching — admit and retire requests at every decode step rather than waiting for a full batch to finish. llama-server --cont-batching --parallel N is llama.cpp's implementation of the same idea. See API Serving.

PagedAttention¶

Paged KV-cache allocation — store the per-request KV cache in fixed-size pages rather than one contiguous block. This is what makes vLLM's memory utilization so high. llama.cpp does not have a direct equivalent today; budget KV cache conservatively (see Memory Management).

Traditional: Contiguous memory allocation
+-------------------------------------+
| Request 1 KV Cache (wasted space)   |
+-------------------------------------+
| Request 2 KV Cache                  |
+-------------------------------------+

PagedAttention: Paged memory blocks
+----+----+----+----+----+----+----+----+
| R1 | R2 | R1 | R3 | R2 | R1 | R3 | R2 |
+----+----+----+----+----+----+----+----+

Comparison with Alternatives Used on the MS-S1 MAX¶

Feature	vLLM (not used)	llama.cpp (HIP)	Ollama (ROCm)
Runs on `gfx1151` today	No	Yes	Yes
Throughput on supported HW	Highest	Good	Good
Batching	Continuous + paged	Continuous	Continuous
GPU backends	ROCm (CDNA/gfx1100), CUDA	ROCm/HIP, Metal, Vulkan, CUDA	ROCm, Metal, CUDA
Setup	Medium	Easy	Easiest
Apple Silicon	No	Yes (Metal)	Yes (Metal)

When gfx1151 lands in vLLM upstream, the candidate path will be the official ROCm image with devices: /dev/kfd, /dev/dri, group_add: [video, render], and the usual HSA_OVERRIDE_GFX_VERSION=11.5.1 override if needed. Until then, do not deploy.