Skip to content

Context Optimization

Balance context length, memory usage, and performance.

Understanding Context

Context length determines:

  • How much conversation history the model "remembers"
  • How much code/text can be analyzed at once
  • Memory requirements for KV cache
  • Time to first token (longer context = slower)

Context vs Memory

KV Cache Growth

KV Cache Size ≈ 2 × num_layers × context_length × hidden_size × bytes_per_param

Example (70B model, FP16 KV cache):
- 80 layers × 8192 context × 8192 hidden × 2 bytes × 2 (K+V)
- ≈ 21 GB for 8K context

Scaling:
- 4K context:  ~10 GB KV cache
- 8K context:  ~21 GB KV cache
- 16K context: ~42 GB KV cache
- 32K context: ~84 GB KV cache

Memory Budget

For 128GB system with 70B Q4 model:

Component Memory Notes
Model weights 43GB Q4_K_M
System reserve 16GB OS, apps
Available for KV ~69GB Context budget
Max context ~32K With headroom

Context Length Settings

llama.cpp

# Set context length
./llama-server -m model.gguf -c 8192

# Long context with flash attention
./llama-server -m model.gguf -c 32768 --flash-attn

Ollama

# Via API
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.3:70b", "options": {"num_ctx": 8192}}'

# In Modelfile
PARAMETER num_ctx 16384

Per-Request

{
  "model": "llama3.3",
  "messages": [...],
  "options": {
    "num_ctx": 16384
  }
}

Flash Attention

Reduces memory usage for long contexts:

Enable in llama.cpp

# Compile with flash attention
cmake -B build -DGGML_FLASH_ATTN=ON

# Run with flash attention
./llama-server -m model.gguf -c 32768 --flash-attn

Benefits

Context Without Flash With Flash Savings
8K 21GB 15GB 29%
16K 42GB 25GB 40%
32K 84GB 45GB 46%

RoPE Scaling

Extend context beyond trained length:

Scaling Methods

Method Description Quality
Linear Simple scaling Good for 2-4x
NTK-aware Better quality Good for 4-8x
YaRN Best quality Best for 8x+

Configuration

# llama.cpp with RoPE scaling
./llama-server \
  -m model.gguf \
  -c 32768 \
  --rope-freq-base 1000000 \
  --rope-freq-scale 0.5

Caveats

  • Quality degrades beyond trained context
  • Test carefully for your use case
  • Some models have native long context (128K)

Optimizing for Use Cases

Chat/Conversation

  • 4-8K context usually sufficient
  • Lower context = faster responses
  • Implement conversation summarization for long sessions
def optimize_conversation(messages, max_tokens=4000):
    """Keep recent messages within context limit."""
    # Estimate tokens (rough)
    total = sum(len(m['content']) // 4 for m in messages)

    while total > max_tokens and len(messages) > 2:
        messages.pop(1)  # Remove oldest (keep system)
        total = sum(len(m['content']) // 4 for m in messages)

    return messages

Code Analysis

  • Larger context for full file analysis
  • 16-32K for multi-file context
  • Consider chunking large files
def chunk_code(code, chunk_size=8000, overlap=500):
    """Split code into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(code):
        end = start + chunk_size
        chunks.append(code[start:end])
        start = end - overlap
    return chunks

RAG Applications

  • Smaller context per query
  • Rely on retrieval instead of context
  • 4-8K usually sufficient

Context Window Strategies

Sliding Window

Keep only recent context:

def sliding_window(messages, max_messages=20):
    """Keep only recent messages."""
    if len(messages) <= max_messages:
        return messages
    # Keep system prompt + recent
    return [messages[0]] + messages[-(max_messages-1):]

Summarization

Summarize old context:

def summarize_old_context(messages, summarizer):
    """Summarize old messages, keep recent."""
    if len(messages) < 20:
        return messages

    old = messages[1:-10]  # Old messages (skip system)
    recent = messages[-10:]  # Recent messages

    summary = summarizer(old)
    return [messages[0], {"role": "system", "content": f"Previous context summary: {summary}"}] + recent

Smart Truncation

Truncate intelligently:

def smart_truncate(messages, max_tokens=8000):
    """Truncate while preserving important messages."""
    # Always keep: system prompt, last N messages
    # Score middle messages by importance
    # Remove lowest-scored first
    pass

Monitoring Context Usage

Check Current Usage

# llama.cpp metrics
curl http://localhost:8080/metrics | grep context

# Ollama
ollama ps  # Shows context usage

Track Over Time

def track_context(messages):
    """Log context size over time."""
    import logging
    tokens = sum(len(m['content'].split()) * 1.3 for m in messages)
    logging.info(f"Context size: ~{int(tokens)} tokens")

Performance Impact

Context Length vs Speed

Context Prompt Eval Generation TTFT
2K Fast Fast <100ms
8K Good Good <200ms
16K Slower Good <400ms
32K Much slower Good <800ms

Recommendation

Default: 8192 (good balance)
Chat: 4096-8192
Code: 16384-32768
RAG: 4096-8192
Long docs: 32768+ (with flash attention)

Troubleshooting

Out of Memory

# Reduce context
-c 4096  # Instead of 8192

# Enable flash attention
--flash-attn

# Use smaller quantization
model-q4_k_s.gguf  # Instead of q4_k_m

Slow First Token

  • Reduce context length
  • Use flash attention
  • Pre-compute KV cache for static prompts

Context Overflow Errors

  • Implement token counting
  • Truncate messages before sending
  • Use model's actual context limit

See Also