vLLM¶

High-throughput LLM serving engine with PagedAttention and continuous batching for production deployments.

Overview¶

vLLM provides:

14-24x throughput vs HuggingFace Transformers
PagedAttention - Efficient KV cache memory management
Continuous batching - Dynamic request scheduling
OpenAI-compatible API - Drop-in replacement
Multi-GPU support - Tensor and pipeline parallelism

Requirements¶

NVIDIA GPU - Primary support (CUDA 11.8+)
AMD GPU - ROCm support (experimental)
Linux - Primary platform
Python 3.9+

Apple Silicon

vLLM does not support Apple Silicon. Use MLX or llama.cpp instead.

Installation¶

pip Install¶

# Basic installation
pip install vllm

# With specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Docker¶

docker run --gpus all \
  -v /tank/ai/models/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct

From Source¶

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Quick Start¶

Command Line¶

# Start OpenAI-compatible server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# With quantization
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization awq

Python API¶

from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")

# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Server Configuration¶

Basic Server¶

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192

Key Parameters¶

Parameter	Description	Default
`--model`	Model name or path	Required
`--host`	Listen address	`localhost`
`--port`	Listen port	`8000`
`--tensor-parallel-size`	GPUs for tensor parallelism	1
`--pipeline-parallel-size`	GPUs for pipeline parallelism	1
`--max-model-len`	Maximum context length	Model default
`--gpu-memory-utilization`	GPU memory fraction	0.9
`--quantization`	Quantization method	None
`--dtype`	Data type	`auto`

Multi-GPU Setup¶

# 2 GPUs with tensor parallelism
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95

# 4 GPUs with pipeline + tensor
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

API Usage¶

Chat Completion¶

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Docker containers."}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming¶

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Completions¶

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 20
  }'

Python Client¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Required but not validated
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Write a haiku about coding"}
    ]
)
print(response.choices[0].message.content)

Quantization¶

AWQ¶

# Use AWQ-quantized model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq

GPTQ¶

vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

Supported Quantization¶

Method	Memory Savings	Quality	Notes
AWQ	~75%	Good	Recommended for vLLM
GPTQ	~75%	Good	Wide model availability
SqueezeLLM	~75%	Good	Newer method
FP8	~50%	Excellent	H100/RTX 40 series

Performance Features¶

Continuous Batching¶

Automatically batches concurrent requests:

# Multiple concurrent requests handled efficiently
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")

async def generate(prompt):
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Run 10 requests concurrently
prompts = [f"Question {i}" for i in range(10)]
results = asyncio.run(asyncio.gather(*[generate(p) for p in prompts]))

PagedAttention¶

Memory-efficient KV cache management:

Traditional: Contiguous memory allocation
┌─────────────────────────────────────┐
│ Request 1 KV Cache (wasted space)   │
├─────────────────────────────────────┤
│ Request 2 KV Cache                  │
└─────────────────────────────────────┘

PagedAttention: Paged memory blocks
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │
└────┴────┴────┴────┴────┴────┴────┴────┘

Benefits: - Near-zero memory waste - More concurrent requests - Better GPU utilization

Speculative Decoding¶

Use draft model for faster inference:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5

Docker Deployment¶

docker-compose¶

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - /tank/ai/models/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.3-70B-Instruct
      --tensor-parallel-size 1
      --max-model-len 8192
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Health Checks¶

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Monitoring¶

Prometheus Metrics¶

# Enable metrics
vllm serve model --enable-metrics

# Metrics available at
curl http://localhost:8000/metrics

Key metrics: - vllm:num_requests_running - Active requests - vllm:num_requests_waiting - Queued requests - vllm:gpu_cache_usage_perc - KV cache utilization - vllm:avg_prompt_throughput_toks_per_s - Input throughput - vllm:avg_generation_throughput_toks_per_s - Output throughput

Logging¶

# Verbose logging
vllm serve model --log-level debug

Performance Tuning¶

Memory Optimization¶

# Increase GPU memory usage
vllm serve model --gpu-memory-utilization 0.95

# Reduce max context for more concurrent requests
vllm serve model --max-model-len 4096

Throughput Optimization¶

# Enable chunked prefill for better batching
vllm serve model --enable-chunked-prefill

# Tune block size
vllm serve model --block-size 32

Benchmarking¶

# Use vLLM benchmarking tools
python -m vllm.entrypoints.openai.api_server_benchmark \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --num-prompts 100 \
  --request-rate 10

Troubleshooting¶

CUDA Out of Memory¶

# Reduce memory usage
--gpu-memory-utilization 0.8
--max-model-len 4096

# Use quantization
--quantization awq

Model Not Loading¶

# Check HuggingFace token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token

# Or use local path
vllm serve /tank/ai/models/huggingface/models--meta-llama--Llama-3.3-70B-Instruct/snapshots/...

Slow Startup¶

First startup is slow due to model loading. Subsequent starts use cache:

# Pre-download model
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct

Comparison with Alternatives¶

Feature	vLLM	llama.cpp	Ollama
Throughput	Highest	Medium	Medium
Batching	Continuous	Basic	Basic
GPU Support	NVIDIA/AMD	All	All
Setup	Medium	Easy	Easiest
Memory Efficiency	Excellent	Good	Good
Apple Silicon	No	Yes	Yes

Use vLLM when: - Running on NVIDIA GPUs - Need high throughput - Serving multiple concurrent users