Ollama¶
Docker-like simplicity for running local LLMs with built-in model management.
Overview¶
Ollama provides:
- Simple CLI -
ollama run llama3.3to get started - Model library - Built-in discovery and downloads
- OpenAI-compatible API - Drop-in replacement
- Modelfiles - Customize models like Dockerfiles
- Cross-platform - macOS, Linux, Windows
Installation¶
macOS¶
# Homebrew
brew install ollama
# Or download from ollama.com
curl -fsSL https://ollama.com/download/mac -o ollama.pkg
open ollama.pkg
Linux¶
# Install script
curl -fsSL https://ollama.com/install.sh | sh
# Starts service automatically
systemctl status ollama
Manual Service Setup¶
# Create service user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
# Service file
sudo tee /etc/systemd/system/ollama.service <<EOF
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/tank/ai/models/ollama"
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
Basic Usage¶
Running Models¶
# Start chat with a model (downloads if needed)
ollama run llama3.3
# Specific version/quantization
ollama run llama3.3:70b-instruct-q4_K_M
# Exit chat
/bye
Model Management¶
# List installed models
ollama list
# Pull a model
ollama pull deepseek-coder-v2:16b
# Show model info
ollama show llama3.3
# Remove a model
ollama rm llama3.3:latest
# Copy/rename model
ollama cp llama3.3 my-llama
Check Status¶
# Running models
ollama ps
# Output:
# NAME ID SIZE PROCESSOR UNTIL
# llama3.3:70b abc123def456 43 GB GPU 4 minutes from now
Popular Models¶
| Model | Size | Use Case | Command |
|---|---|---|---|
| Llama 3.3 70B | ~43GB Q4 | General, coding | ollama run llama3.3:70b |
| Qwen 2.5 72B | ~45GB Q4 | Multilingual | ollama run qwen2.5:72b |
| DeepSeek Coder V2 | ~9GB | Coding | ollama run deepseek-coder-v2 |
| Mistral Large 2 | ~75GB Q4 | General | ollama run mistral-large |
| Llama 3.2 8B | ~5GB | Fast, mobile | ollama run llama3.2 |
For 128GB systems, 70B models at Q4 quantization work well.
API Usage¶
Chat Completion (OpenAI-compatible)¶
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker in one paragraph."}
]
}'
Native API¶
# Generate (completion)
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.3", "prompt": "Why is the sky blue?"}'
# Chat
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Streaming (default)
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.3", "prompt": "Tell me a story", "stream": true}'
Embeddings¶
curl http://localhost:11434/api/embeddings \
-d '{"model": "nomic-embed-text", "prompt": "Hello world"}'
Environment Variables¶
Configure via environment:
| Variable | Description | Default |
|---|---|---|
OLLAMA_HOST | Listen address | 127.0.0.1:11434 |
OLLAMA_MODELS | Model storage path | ~/.ollama/models |
OLLAMA_NUM_PARALLEL | Concurrent requests | 1 |
OLLAMA_MAX_LOADED_MODELS | Models in memory | 1 |
OLLAMA_KEEP_ALIVE | Model unload timeout | 5m |
OLLAMA_DEBUG | Debug logging | false |
Example configuration:
# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/tank/ai/models/ollama"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=30m"
Modelfiles¶
Customize models with Dockerile-like syntax:
Basic Modelfile¶
# Modelfile
FROM llama3.3:70b
# Set system prompt
SYSTEM """
You are a senior software engineer. Write clean, efficient code.
Focus on Python and TypeScript.
"""
# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Build Custom Model¶
# Create model from Modelfile
ollama create coding-assistant -f Modelfile
# Run it
ollama run coding-assistant
Import GGUF¶
# Modelfile for GGUF import
FROM /tank/ai/models/gguf/custom-model.gguf
TEMPLATE """{{ if .System }}{{ .System }}
{{ end }}{{ if .Prompt }}User: {{ .Prompt }}
{{ end }}Assistant: """
PARAMETER stop "User:"
Modelfile Parameters¶
| Parameter | Description | Example |
|---|---|---|
FROM | Base model or path | llama3.3:70b |
SYSTEM | System prompt | "You are..." |
TEMPLATE | Prompt template | Custom format |
PARAMETER temperature | Randomness | 0.7 |
PARAMETER num_ctx | Context length | 8192 |
PARAMETER num_gpu | GPU layers | 99 |
PARAMETER stop | Stop sequences | ["User:"] |
Multi-Model Setup¶
Keep Multiple Models Loaded¶
# Allow 2 models in memory
export OLLAMA_MAX_LOADED_MODELS=2
# Load models
ollama run llama3.3
# In another terminal
ollama run deepseek-coder-v2
Model Switching¶
# Check loaded models
ollama ps
# Unload a model
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.3", "keep_alive": 0}'
Integration Examples¶
Python¶
import ollama
response = ollama.chat(
model='llama3.3',
messages=[
{'role': 'user', 'content': 'Explain Kubernetes briefly'}
]
)
print(response['message']['content'])
OpenAI Python SDK¶
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but ignored
)
response = client.chat.completions.create(
model='llama3.3',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
JavaScript/TypeScript¶
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
});
const response = await client.chat.completions.create({
model: 'llama3.3',
messages: [{ role: 'user', content: 'Hello!' }]
});
Performance¶
GPU Memory Usage¶
# Check GPU allocation
ollama ps
# For 70B Q4 on 128GB Mac:
# - ~43GB for model weights
# - ~2GB for 8K context KV cache
# - Leaves ~80GB for system + other models
Speed Comparison¶
On M4 Max (128GB), Llama 3.3 70B Q4:
| Metric | Value |
|---|---|
| Tokens/sec | ~35 |
| Time to first token | ~200ms |
| Context processing | ~500 tok/sec |
Troubleshooting¶
Model Download Fails¶
# Check disk space
df -h ~/.ollama
# Resume interrupted download
ollama pull llama3.3:70b
# Clear corrupted download
rm -rf ~/.ollama/models/blobs/sha256-<partial>
ollama pull llama3.3:70b
Out of Memory¶
# Use smaller quantization
ollama run llama3.3:70b-instruct-q4_K_S # Instead of Q4_K_M
# Reduce context
curl -d '{"model": "llama3.3", "options": {"num_ctx": 4096}}' \
http://localhost:11434/api/generate
Slow Startup¶
# Pre-load model on service start
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.3", "keep_alive": "24h"}'
API Connection Refused¶
# Bind to all interfaces
export OLLAMA_HOST=0.0.0.0
ollama serve
# Or in service file
Environment="OLLAMA_HOST=0.0.0.0"
See Also¶
- Inference Engines Index - Engine comparison
- Ollama Docker - Container deployment
- OpenAI Compatible - API details
- Model Volumes - ZFS storage for models