API Serving¶

Expose local LLMs via OpenAI-compatible APIs for integration with tools and services.

Overview¶

OpenAI-compatible APIs enable:

Tool compatibility - Claude Code, Aider, Continue.dev work seamlessly
Standard interface - Single API works with any backend
Flexibility - Switch models without changing client code
Ecosystem - Libraries, SDKs, and tools work out of the box

API Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                      Client Applications                        │
│  ┌─────────────┐  ┌─────────────┐  ┌───────────────────────┐   │
│  │ Claude Code │  │   Aider     │  │    Custom Apps        │   │
│  │ (OpenAI)    │  │  (OpenAI)   │  │    (OpenAI SDK)       │   │
│  └──────┬──────┘  └──────┬──────┘  └───────────┬───────────┘   │
│         │                │                     │                │
│         └────────────────┴─────────────────────┘                │
│                          │                                      │
│              POST /v1/chat/completions                          │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│                    API Gateway                                  │
│              (Optional: Traefik/nginx)                          │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│                   Inference Backends                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │   Ollama    │  │ llama.cpp   │  │       LocalAI           │ │
│  │   :11434    │  │   :8080     │  │       :8080             │ │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Standard Endpoints¶

All OpenAI-compatible servers provide these endpoints:

Endpoint	Method	Purpose
`/v1/chat/completions`	POST	Chat/conversation
`/v1/completions`	POST	Text completion
`/v1/models`	GET	List available models
`/v1/embeddings`	POST	Generate embeddings
`/health`	GET	Health check

Backend Comparison¶

Backend	Strengths	Endpoints	Notes
Ollama	Easy setup, model management	All standard	Recommended start
llama.cpp	Performance, flexibility	All standard	Production
LocalAI	Drop-in replacement, multimodal	Full OpenAI	Feature-rich
vLLM	High throughput	All standard	Multi-GPU

Quick Start¶

Test API¶

# List models
curl http://localhost:11434/v1/models

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:70b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

With Streaming¶

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:70b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Client Configuration¶

OpenAI Python SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"  # Required but not validated
)

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

JavaScript/TypeScript¶

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'llama3.3:70b',
  messages: [{ role: 'user', content: 'Hello!' }]
});

curl¶

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer not-needed" \
  -d '{"model": "llama3.3:70b", "messages": [{"role": "user", "content": "Hello"}]}'

Topics¶

OpenAI Compatible

Standard endpoints and request/response formats

API reference
LocalAI

Full OpenAI replacement with multimodal support

LocalAI setup
Load Balancing

Multiple backends with Traefik routing

Load balancing

Environment Setup¶

For Coding Tools¶

# Set environment variables
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed

# Or per-tool configuration
# See coding-tools section for specific tools

Docker Network Access¶

# Containers can access by service name
services:
  ollama:
    container_name: ollama
    # ...

  app:
    environment:
      - OPENAI_API_BASE=http://ollama:11434/v1

Common Configurations¶

Single Backend¶

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - /tank/ai/models/ollama:/root/.ollama

Multi-Backend with Gateway¶

services:
  traefik:
    image: traefik:v3.0
    ports:
      - "8080:80"
    # Routes to different backends

  ollama:
    # General models

  llama-code:
    # Code-specialized model

See Load Balancing for details.