Skip to content

Ollama

Docker-like simplicity for running local LLMs with built-in model management.

Overview

Ollama provides:

  • Simple CLI - ollama run llama3.3 to get started
  • Model library - Built-in discovery and downloads
  • OpenAI-compatible API - Drop-in replacement
  • Modelfiles - Customize models like Dockerfiles
  • Cross-platform - macOS, Linux, Windows

Installation

macOS

# Homebrew
brew install ollama

# Or download from ollama.com
curl -fsSL https://ollama.com/download/mac -o ollama.pkg
open ollama.pkg

Linux

# Install script
curl -fsSL https://ollama.com/install.sh | sh

# Starts service automatically
systemctl status ollama

Manual Service Setup

# Create service user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama

# Service file
sudo tee /etc/systemd/system/ollama.service <<EOF
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/tank/ai/models/ollama"

[Install]
WantedBy=default.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama

Basic Usage

Running Models

# Start chat with a model (downloads if needed)
ollama run llama3.3

# Specific version/quantization
ollama run llama3.3:70b-instruct-q4_K_M

# Exit chat
/bye

Model Management

# List installed models
ollama list

# Pull a model
ollama pull deepseek-coder-v2:16b

# Show model info
ollama show llama3.3

# Remove a model
ollama rm llama3.3:latest

# Copy/rename model
ollama cp llama3.3 my-llama

Check Status

# Running models
ollama ps

# Output:
# NAME              ID              SIZE     PROCESSOR  UNTIL
# llama3.3:70b      abc123def456    43 GB    GPU        4 minutes from now
Model Size Use Case Command
Llama 3.3 70B ~43GB Q4 General, coding ollama run llama3.3:70b
Qwen 2.5 72B ~45GB Q4 Multilingual ollama run qwen2.5:72b
DeepSeek Coder V2 ~9GB Coding ollama run deepseek-coder-v2
Mistral Large 2 ~75GB Q4 General ollama run mistral-large
Llama 3.2 8B ~5GB Fast, mobile ollama run llama3.2

For 128GB systems, 70B models at Q4 quantization work well.

API Usage

Chat Completion (OpenAI-compatible)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Docker in one paragraph."}
    ]
  }'

Native API

# Generate (completion)
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3", "prompt": "Why is the sky blue?"}'

# Chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming (default)
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3", "prompt": "Tell me a story", "stream": true}'

Embeddings

curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "Hello world"}'

Environment Variables

Configure via environment:

Variable Description Default
OLLAMA_HOST Listen address 127.0.0.1:11434
OLLAMA_MODELS Model storage path ~/.ollama/models
OLLAMA_NUM_PARALLEL Concurrent requests 1
OLLAMA_MAX_LOADED_MODELS Models in memory 1
OLLAMA_KEEP_ALIVE Model unload timeout 5m
OLLAMA_DEBUG Debug logging false

Example configuration:

# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/tank/ai/models/ollama"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=30m"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Modelfiles

Customize models with Dockerile-like syntax:

Basic Modelfile

# Modelfile
FROM llama3.3:70b

# Set system prompt
SYSTEM """
You are a senior software engineer. Write clean, efficient code.
Focus on Python and TypeScript.
"""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build Custom Model

# Create model from Modelfile
ollama create coding-assistant -f Modelfile

# Run it
ollama run coding-assistant

Import GGUF

# Modelfile for GGUF import
FROM /tank/ai/models/gguf/custom-model.gguf

TEMPLATE """{{ if .System }}{{ .System }}

{{ end }}{{ if .Prompt }}User: {{ .Prompt }}
{{ end }}Assistant: """

PARAMETER stop "User:"

Modelfile Parameters

Parameter Description Example
FROM Base model or path llama3.3:70b
SYSTEM System prompt "You are..."
TEMPLATE Prompt template Custom format
PARAMETER temperature Randomness 0.7
PARAMETER num_ctx Context length 8192
PARAMETER num_gpu GPU layers 99
PARAMETER stop Stop sequences ["User:"]

Multi-Model Setup

Keep Multiple Models Loaded

# Allow 2 models in memory
export OLLAMA_MAX_LOADED_MODELS=2

# Load models
ollama run llama3.3
# In another terminal
ollama run deepseek-coder-v2

Model Switching

# Check loaded models
ollama ps

# Unload a model
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3", "keep_alive": 0}'

Integration Examples

Python

import ollama

response = ollama.chat(
    model='llama3.3',
    messages=[
        {'role': 'user', 'content': 'Explain Kubernetes briefly'}
    ]
)
print(response['message']['content'])

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but ignored
)

response = client.chat.completions.create(
    model='llama3.3',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama'
});

const response = await client.chat.completions.create({
  model: 'llama3.3',
  messages: [{ role: 'user', content: 'Hello!' }]
});

Performance

GPU Memory Usage

# Check GPU allocation
ollama ps

# For 70B Q4 on 128GB Mac:
# - ~43GB for model weights
# - ~2GB for 8K context KV cache
# - Leaves ~80GB for system + other models

Speed Comparison

On M4 Max (128GB), Llama 3.3 70B Q4:

Metric Value
Tokens/sec ~35
Time to first token ~200ms
Context processing ~500 tok/sec

Troubleshooting

Model Download Fails

# Check disk space
df -h ~/.ollama

# Resume interrupted download
ollama pull llama3.3:70b

# Clear corrupted download
rm -rf ~/.ollama/models/blobs/sha256-<partial>
ollama pull llama3.3:70b

Out of Memory

# Use smaller quantization
ollama run llama3.3:70b-instruct-q4_K_S  # Instead of Q4_K_M

# Reduce context
curl -d '{"model": "llama3.3", "options": {"num_ctx": 4096}}' \
  http://localhost:11434/api/generate

Slow Startup

# Pre-load model on service start
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3", "keep_alive": "24h"}'

API Connection Refused

# Bind to all interfaces
export OLLAMA_HOST=0.0.0.0
ollama serve

# Or in service file
Environment="OLLAMA_HOST=0.0.0.0"

See Also