MLX¶

Apple's machine learning framework optimized for Apple Silicon, providing up to 87% faster inference than llama.cpp.

Overview¶

MLX provides:

Apple Silicon native - Designed for M-series unified memory
Metal acceleration - Full GPU utilization
Lazy evaluation - Efficient memory usage
NumPy-like API - Familiar Python interface
Active development - Regular performance improvements

Requirements¶

macOS 13.5+ (Ventura or later)
Apple Silicon (M1, M2, M3, M4 series)
Python 3.9+

Installation¶

Basic Installation¶

# Install mlx and language model support
pip install mlx-lm

# Or with uv
uv pip install mlx-lm

With Development Tools¶

pip install mlx-lm transformers huggingface_hub

Verify Installation¶

import mlx.core as mx
print(f"MLX version: {mx.__version__}")
print(f"Default device: {mx.default_device()}")
# Should show: gpu

Quick Start¶

Command Line¶

# Generate text (downloads model if needed)
mlx_lm.generate \
  --model mlx-community/Llama-3.3-70B-Instruct-4bit \
  --prompt "Explain recursion in programming"

# Chat mode
mlx_lm.chat --model mlx-community/Llama-3.3-70B-Instruct-4bit

Python API¶

from mlx_lm import load, generate

# Load model (cached after first download)
model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

# Generate
response = generate(
    model,
    tokenizer,
    prompt="What is Docker?",
    max_tokens=200,
    temp=0.7
)
print(response)

MLX-Community Models¶

Pre-quantized models optimized for MLX:

Model	Size	HuggingFace Path
Llama 3.3 70B 4-bit	~40GB	`mlx-community/Llama-3.3-70B-Instruct-4bit`
Llama 3.3 70B 8-bit	~70GB	`mlx-community/Llama-3.3-70B-Instruct-8bit`
Qwen 2.5 72B 4-bit	~42GB	`mlx-community/Qwen2.5-72B-Instruct-4bit`
DeepSeek Coder 33B	~20GB	`mlx-community/DeepSeek-Coder-V2-Instruct-4bit`
Mistral 7B 4-bit	~4GB	`mlx-community/Mistral-7B-Instruct-v0.3-4bit`

Browse all at huggingface.co/mlx-community.

Model Conversion¶

Convert models to MLX format:

From Hugging Face¶

# Convert and quantize
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.3-70B-Instruct \
  --mlx-path ./llama-3.3-70b-4bit \
  -q  # Quantize to 4-bit

Quantization Options¶

# 4-bit (smallest)
mlx_lm.convert --hf-path model -q --q-bits 4

# 8-bit (higher quality)
mlx_lm.convert --hf-path model -q --q-bits 8

# Group size for quantization
mlx_lm.convert --hf-path model -q --q-bits 4 --q-group-size 64

From GGUF¶

# Convert GGUF to MLX format
mlx_lm.convert \
  --gguf-path /path/to/model.gguf \
  --mlx-path ./converted-model

Server Mode¶

Run MLX as an OpenAI-compatible server:

Using mlx-lm-server¶

# Install server
pip install mlx-lm[server]

# Start server
mlx_lm.server \
  --model mlx-community/Llama-3.3-70B-Instruct-4bit \
  --host 0.0.0.0 \
  --port 8080

API Usage¶

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

With LiteLLM Proxy¶

For more robust serving:

pip install litellm

# Start proxy pointing to MLX
litellm --model mlx/mlx-community/Llama-3.3-70B-Instruct-4bit

Advanced Generation¶

Streaming¶

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

for token in stream_generate(
    model,
    tokenizer,
    prompt="Write a haiku about programming:",
    max_tokens=50
):
    print(token, end="", flush=True)

Custom Parameters¶

from mlx_lm import generate

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=500,
    temp=0.7,           # Temperature
    top_p=0.95,         # Nucleus sampling
    repetition_penalty=1.1,
    repetition_context_size=20
)

Chat Format¶

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a string."}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=500)

Performance Optimization¶

Memory Management¶

import mlx.core as mx

# Clear memory cache
mx.metal.clear_cache()

# Check memory usage
print(f"Peak memory: {mx.metal.get_peak_memory() / 1e9:.2f} GB")
print(f"Active memory: {mx.metal.get_active_memory() / 1e9:.2f} GB")

Batch Generation¶

# Generate multiple responses
prompts = ["Question 1:", "Question 2:", "Question 3:"]

for prompt in prompts:
    response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
    print(response)
    mx.metal.clear_cache()  # Clear between generations

KV Cache Optimization¶

For long conversations, the KV cache can grow large:

# Limit context for memory efficiency
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=500,
    max_kv_size=4096  # Limit KV cache
)

Performance Comparison¶

Benchmarks on M4 Max (128GB), Llama 3.3 70B 4-bit:

Framework	Tokens/sec	TTFT	Notes
MLX	45-50	~100ms	Metal optimized
llama.cpp (Metal)	35-40	~150ms	Good baseline
Ollama	33-38	~200ms	Convenience overhead

MLX advantages: - 20-40% faster token generation - Better memory efficiency - Lower time to first token

Speculative Decoding¶

Use a small model to speed up a large model:

from mlx_lm import load, generate

# Load draft and target models
draft_model, _ = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
target_model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

# Speculative generation (coming in mlx-lm)
response = generate(
    target_model,
    tokenizer,
    prompt="Explain quantum computing",
    draft_model=draft_model,
    max_tokens=500
)

Troubleshooting¶

Out of Memory¶

# Use smaller quantization
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit

# Or reduce context
mlx_lm.generate --model model --max-kv-size 4096

Slow First Load¶

Model loading includes compilation. Subsequent runs are faster:

# First run: ~30 seconds (compilation)
# Subsequent runs: ~5 seconds (cached)

Model Not Found¶

# Login for gated models
huggingface-cli login

# Set cache directory
export HF_HOME=/tank/ai/models/huggingface

GPU Not Used¶

import mlx.core as mx

# Verify GPU is available
print(mx.default_device())  # Should show: Device(gpu, 0)

# Force GPU
mx.set_default_device(mx.gpu)