Choosing Models
Select the right model for your use case, hardware, and quality requirements.
Decision Framework
┌─────────────────────────────────────────────────────────────┐
│ What's your primary use case? │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Code │ │ Chat │ │ Docs │
│ Assist │ │ /QA │ │ Process │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ DeepSeek Coder │ │ Llama 3.3 70B │ │ Qwen 2.5 72B │
│ Qwen 2.5 Coder │ │ Qwen 2.5 72B │ │ (Long context) │
└────────────────┘ └────────────────┘ └────────────────┘
Models by Memory Tier
32GB RAM/VRAM
| Model | Quant | Size | Use Case |
| Llama 3.2 8B | Q8_0 | ~10GB | Fast general |
| Qwen 2.5 7B | Q8_0 | ~9GB | Multilingual |
| DeepSeek Coder V2 Lite | Q8_0 | ~9GB | Coding |
| Mistral 7B | Q8_0 | ~9GB | Efficient general |
| Gemma 2 9B | Q8_0 | ~11GB | Balanced |
64GB RAM/VRAM
| Model | Quant | Size | Use Case |
| Llama 3.3 70B | Q3_K_M | ~35GB | General (quality tradeoff) |
| Qwen 2.5 32B | Q6_K | ~28GB | Quality balance |
| DeepSeek Coder 33B | Q5_K_M | ~25GB | Coding |
| Mistral Large | Q3_K_M | ~38GB | Reasoning |
| Mixtral 8x7B | Q5_K_M | ~35GB | MoE efficiency |
128GB RAM/VRAM
| Model | Quant | Size | Use Case |
| Llama 3.3 70B | Q4_K_M | ~43GB | Best 70B balance |
| Llama 3.3 70B | Q6_K | ~58GB | Higher quality |
| Qwen 2.5 72B | Q5_K_M | ~55GB | Multilingual |
| DeepSeek V3 | Q4_K_M | ~75GB | Latest reasoning |
| Llama 3.1 405B | Q2_K | ~95GB | Maximum capability |
192GB+ RAM/VRAM
| Model | Quant | Size | Use Case |
| Llama 3.1 405B | Q4_K_M | ~180GB | Full quality 405B |
| Llama 3.3 70B | Q8_0 | ~75GB | Highest 70B quality |
| Multiple 70B | Q4_K_M | 43GB each | Multi-model serving |
Use Case Recommendations
AI-Assisted Coding
Primary choice: DeepSeek Coder V2 or Qwen 2.5 Coder
| Requirement | Model | Notes |
| Fast completions | DeepSeek Coder V2 Lite (16B) | Low latency |
| High quality | DeepSeek Coder V2 (236B) | Best code quality |
| Balanced | Qwen 2.5 Coder 32B | Good quality/speed |
| Fill-in-middle | DeepSeek Coder | Native FIM support |
# Recommended for coding
ollama pull deepseek-coder-v2:16b
ollama pull qwen2.5-coder:32b
General Chat/Assistant
Primary choice: Llama 3.3 70B or Qwen 2.5 72B
| Requirement | Model | Notes |
| Best reasoning | Llama 3.3 70B | Latest Llama |
| Multilingual | Qwen 2.5 72B | 29 languages |
| Fast response | Llama 3.2 8B | Sub-second latency |
| Long context | Qwen 2.5 (128K) | Document analysis |
Document Processing
Primary choice: Qwen 2.5 with long context
| Requirement | Model | Notes |
| Long documents | Qwen 2.5 72B (128K) | Full document context |
| Summarization | Llama 3.3 70B | Strong instruction following |
| Extraction | Mixtral 8x7B | Efficient for structured |
RAG/Embeddings
Pair with embedding model:
| Embedding Model | Dimensions | Notes |
| nomic-embed-text | 768 | Good default |
| mxbai-embed-large | 1024 | Higher quality |
| all-MiniLM-L6-v2 | 384 | Fast, small |
ollama pull nomic-embed-text
Quality vs Speed Tradeoffs
Response Latency
| Model Size | TTFT | Tokens/sec | Use Case |
| 7-8B | <100ms | 50-80 | Interactive |
| 32-34B | 100-300ms | 30-50 | Balanced |
| 70B | 200-500ms | 20-40 | Quality focus |
| 405B | 500ms+ | 10-20 | Maximum quality |
Quality Benchmarks
Approximate rankings (higher = better):
General Reasoning (MMLU-like):
Llama 3.1 405B > DeepSeek V3 > Llama 3.3 70B > Qwen 2.5 72B > 32B models
Coding (HumanEval-like):
DeepSeek Coder V2 > Qwen 2.5 Coder > Llama 3.3 70B > CodeLlama
Instruction Following:
Llama 3.3 70B > Qwen 2.5 72B > Mistral Large > 32B models
Model Families
| Model | Parameters | Context | Notes |
| Llama 3.2 | 1B, 3B, 8B | 128K | Efficient, mobile |
| Llama 3.3 | 70B | 128K | Latest, best 70B |
| Llama 3.1 | 8B, 70B, 405B | 128K | Full range |
Qwen (Alibaba)
| Model | Parameters | Context | Notes |
| Qwen 2.5 | 0.5B-72B | 128K | Multilingual |
| Qwen 2.5 Coder | 1.5B-32B | 128K | Code specialized |
| Qwen 2.5 Math | 1.5B-72B | - | Math/reasoning |
DeepSeek
| Model | Parameters | Context | Notes |
| DeepSeek V3 | 671B (MoE) | 128K | Latest flagship |
| DeepSeek Coder V2 | 16B, 236B | 128K | Code + reasoning |
| DeepSeek R1 | Various | - | Reasoning focused |
Mistral
| Model | Parameters | Context | Notes |
| Mistral 7B | 7B | 32K | Efficient |
| Mixtral 8x7B | 47B (MoE) | 32K | MoE efficiency |
| Mistral Large 2 | 123B | 128K | Flagship |
Practical Recommendations
Single Model Setup
For a 128GB system running one model at a time:
# Best all-around
ollama pull llama3.3:70b-instruct-q4_K_M
# Best for coding
ollama pull deepseek-coder-v2:236b-q4_K_M # If fits
ollama pull deepseek-coder-v2:16b # Fallback
Multi-Model Setup
Run specialized models for different tasks:
# Code assistant (keep loaded)
ollama pull qwen2.5-coder:32b-instruct-q5_K_M
# General chat (load on demand)
ollama pull llama3.3:70b-instruct-q4_K_M
# Embeddings (always available)
ollama pull nomic-embed-text
Development vs Production
| Environment | Priority | Recommendation |
| Development | Speed | 8-32B models, Q4-Q5 |
| Testing | Balance | 32-70B models, Q4-Q5 |
| Production | Quality | 70B+ models, Q5-Q6 |
Version Considerations
Instruct vs Base
- Instruct/Chat: Fine-tuned for conversation, use for most tasks
- Base: Raw model, use for fine-tuning or specialized prompting
# Use instruct versions
ollama pull llama3.3:70b-instruct-q4_K_M
# NOT
ollama pull llama3.3:70b-text-q4_K_M # Base model
Latest Versions
Check for updates periodically:
# Update Ollama models
ollama list # Check installed
ollama pull llama3.3:70b # Re-pull for updates
See Also