Skip to content

OpenAI Compatible API

Standard API endpoints and formats for local LLM inference.

Why OpenAI Compatibility?

  • Ecosystem support - Thousands of tools and libraries work automatically
  • No code changes - Switch between providers by changing base URL
  • Familiar interface - Well-documented, widely understood API
  • Future-proof - Industry standard for LLM APIs

Endpoint Reference

Chat Completions

Primary endpoint for conversational AI:

POST /v1/chat/completions

Request

{
  "model": "llama3.3:70b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706789012,
  "model": "llama3.3:70b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Completions (Legacy)

Text completion without chat format:

POST /v1/completions

Request

{
  "model": "llama3.3:70b",
  "prompt": "The capital of France is",
  "max_tokens": 20,
  "temperature": 0.3
}

List Models

GET /v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3.3:70b",
      "object": "model",
      "created": 1706789012,
      "owned_by": "library"
    }
  ]
}

Embeddings

POST /v1/embeddings

Request

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox"
}

Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.123, -0.456, ...],
      "index": 0
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Request Parameters

Common Parameters

Parameter Type Default Description
model string Required Model identifier
temperature float 0.7 Randomness (0-2)
max_tokens int varies Maximum response length
top_p float 1.0 Nucleus sampling
stream bool false Enable streaming
stop array null Stop sequences

Chat-Specific Parameters

Parameter Type Description
messages array Conversation history
presence_penalty float Reduce repetition
frequency_penalty float Reduce common tokens

Message Format

{
  "messages": [
    {"role": "system", "content": "System prompt..."},
    {"role": "user", "content": "User message..."},
    {"role": "assistant", "content": "Previous response..."},
    {"role": "user", "content": "Follow-up..."}
  ]
}

Roles: - system - Instructions for the model - user - Human input - assistant - Model responses (for context)

Streaming

Enable Streaming

{
  "model": "llama3.3:70b",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}

Response Format (SSE)

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop","index":0}]}

data: [DONE]

Python Streaming Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="x")

stream = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Count to 10"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Backend-Specific Extensions

Ollama Extensions

Ollama's native API at /api/*:

# Pull model
curl -X POST http://localhost:11434/api/pull \
  -d '{"name": "llama3.3:70b"}'

# Show model info
curl -X POST http://localhost:11434/api/show \
  -d '{"name": "llama3.3:70b"}'

# Running models
curl http://localhost:11434/api/ps

llama.cpp Extensions

Additional endpoints:

# Health check
curl http://localhost:8080/health

# Server props
curl http://localhost:8080/props

# Tokenize
curl -X POST http://localhost:8080/tokenize \
  -d '{"content": "Hello world"}'

Error Handling

Error Response Format

{
  "error": {
    "message": "Model not found",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

Common Errors

Code Description Solution
400 Bad request Check request format
401 Unauthorized Add API key (even if ignored)
404 Model not found Pull model first
503 Model loading Wait and retry

Retry Logic

from openai import OpenAI
from tenacity import retry, wait_exponential

client = OpenAI(base_url="http://localhost:11434/v1", api_key="x")

@retry(wait=wait_exponential(min=1, max=10))
def chat(messages):
    return client.chat.completions.create(
        model="llama3.3:70b",
        messages=messages
    )

Rate Limiting

Local servers typically don't enforce rate limits, but consider:

# Ollama: Control concurrent requests
environment:
  - OLLAMA_NUM_PARALLEL=4  # Max concurrent

# llama.cpp: Control slots
command: --parallel 4  # Max concurrent

Best Practices

Model Naming

Use consistent model names:

# Good - matches Ollama naming
model="llama3.3:70b-instruct-q4_K_M"

# Works - Ollama finds closest match
model="llama3.3"

Context Management

Keep context within limits:

def truncate_messages(messages, max_tokens=4000):
    """Keep most recent messages within limit."""
    # Implement token counting and truncation
    pass

Error Handling

try:
    response = client.chat.completions.create(...)
except openai.APIConnectionError:
    print("Server not reachable")
except openai.APIStatusError as e:
    print(f"API error: {e.status_code}")

Testing Compatibility

Basic Test Script

#!/usr/bin/env python3
"""Test OpenAI API compatibility."""

from openai import OpenAI

BASE_URL = "http://localhost:11434/v1"

client = OpenAI(base_url=BASE_URL, api_key="test")

# Test: List models
print("Models:", [m.id for m in client.models.list().data])

# Test: Chat completion
response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Say hello"}],
    max_tokens=10
)
print("Response:", response.choices[0].message.content)

# Test: Streaming
print("Stream: ", end="")
stream = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Count to 3"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
print()

See Also