API Serving¶
Expose local LLMs via OpenAI-compatible APIs for integration with tools and services.
Overview¶
OpenAI-compatible APIs enable:
- Tool compatibility - Claude Code, Aider, Continue.dev work seamlessly
- Standard interface - Single API works with any backend
- Flexibility - Switch models without changing client code
- Ecosystem - Libraries, SDKs, and tools work out of the box
API Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────────────┐ │
│ │ Claude Code │ │ Aider │ │ Custom Apps │ │
│ │ (OpenAI) │ │ (OpenAI) │ │ (OpenAI SDK) │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────┬───────────┘ │
│ │ │ │ │
│ └────────────────┴─────────────────────┘ │
│ │ │
│ POST /v1/chat/completions │
│ │ │
├──────────────────────────┼──────────────────────────────────────┤
│ API Gateway │
│ (Optional: Traefik/nginx) │
│ │ │
├──────────────────────────┼──────────────────────────────────────┤
│ Inference Backends │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Ollama │ │ llama.cpp │ │ LocalAI │ │
│ │ :11434 │ │ :8080 │ │ :8080 │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Standard Endpoints¶
All OpenAI-compatible servers provide these endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/v1/chat/completions | POST | Chat/conversation |
/v1/completions | POST | Text completion |
/v1/models | GET | List available models |
/v1/embeddings | POST | Generate embeddings |
/health | GET | Health check |
Backend Comparison¶
| Backend | Strengths | Endpoints | Notes |
|---|---|---|---|
| Ollama | Easy setup, model management | All standard | Recommended start |
| llama.cpp | Performance, flexibility | All standard | Production |
| LocalAI | Drop-in replacement, multimodal | Full OpenAI | Feature-rich |
| vLLM | High throughput | All standard | Multi-GPU |
Quick Start¶
Test API¶
# List models
curl http://localhost:11434/v1/models
# Chat completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}'
With Streaming¶
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:70b",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
Client Configuration¶
OpenAI Python SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # Required but not validated
)
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
JavaScript/TypeScript¶
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'not-needed'
});
const response = await client.chat.completions.create({
model: 'llama3.3:70b',
messages: [{ role: 'user', content: 'Hello!' }]
});
curl¶
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer not-needed" \
-d '{"model": "llama3.3:70b", "messages": [{"role": "user", "content": "Hello"}]}'
Topics¶
-
OpenAI Compatible
Standard endpoints and request/response formats
-
LocalAI
Full OpenAI replacement with multimodal support
-
Load Balancing
Multiple backends with Traefik routing
Environment Setup¶
For Coding Tools¶
# Set environment variables
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed
# Or per-tool configuration
# See coding-tools section for specific tools
Docker Network Access¶
# Containers can access by service name
services:
ollama:
container_name: ollama
# ...
app:
environment:
- OPENAI_API_BASE=http://ollama:11434/v1
Common Configurations¶
Single Backend¶
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- /tank/ai/models/ollama:/root/.ollama
Multi-Backend with Gateway¶
services:
traefik:
image: traefik:v3.0
ports:
- "8080:80"
# Routes to different backends
ollama:
# General models
llama-code:
# Code-specialized model
See Load Balancing for details.
See Also¶
- Inference Engines - Backend options
- Container Deployment - Docker setup
- AI Coding Tools - Tool configuration
- Remote Access - External access