Model Management¶
Understanding model selection, formats, and acquisition for local LLM inference.
Model Ecosystem Overview¶
┌────────────────────────────────────────────────────────────┐
│ Model Sources │
├────────────────────────────────────────────────────────────┤
│ Hugging Face │ Ollama Library │ Direct Download│
│ (hub.hf.co) │ (ollama.com) │ (vendors) │
└────────┬─────────┴──────────┬───────────┴────────┬─────────┘
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Model Formats │
├────────────────────────────────────────────────────────────┤
│ GGUF │ Safetensors │ PyTorch (.bin) │
│ (llama.cpp, │ (vLLM, MLX, │ (Legacy, │
│ Ollama) │ transformers) │ conversion req) │
└────────┬────────┴─────────┬─────────┴──────────┬───────────┘
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Inference Engines │
├────────────────────────────────────────────────────────────┤
│ llama.cpp │ Ollama │ vLLM │
│ (GGUF native) │ (GGUF native) │ (Safetensors) │
└────────────────────────────────────────────────────────────┘
Quick Reference¶
Models for 128GB System¶
| Model | Parameters | Quantization | VRAM | Best For |
|---|---|---|---|---|
| Llama 3.2 8B | 8B | Q8_0 | ~10GB | Fast responses |
| Qwen 2.5 32B | 32B | Q5_K_M | ~25GB | Balanced |
| DeepSeek Coder V2 | 16B/236B | Q4_K_M | ~12GB/~140GB | Coding |
| Llama 3.3 70B | 70B | Q4_K_M | ~43GB | General purpose |
| Qwen 2.5 72B | 72B | Q4_K_M | ~45GB | Multilingual |
| Llama 3.1 405B | 405B | Q2_K | ~95GB | Maximum capability |
Format Selection¶
| Format | Use With | Pros | Cons |
|---|---|---|---|
| GGUF | llama.cpp, Ollama | Quantized, small | Limited to llama.cpp ecosystem |
| Safetensors | vLLM, transformers | Safe, fast loading | Larger files |
| AWQ | vLLM | Fast inference | NVIDIA only |
| GPTQ | vLLM, transformers | Wide support | Slightly slower |
Model Categories¶
General Purpose¶
| Model | Strengths | Size Range |
|---|---|---|
| Llama 3.3 | Reasoning, instruction following | 70B |
| Qwen 2.5 | Multilingual, long context | 0.5B-72B |
| Mistral Large | European languages, reasoning | 123B |
| Gemma 2 | Efficient, well-tuned | 2B-27B |
Code-Specialized¶
| Model | Languages | Notes |
|---|---|---|
| DeepSeek Coder V2 | Python, TypeScript, Go, Rust | Fill-in-middle support |
| Qwen 2.5 Coder | Python, JavaScript | Strong completion |
| CodeLlama | Python, C++, Java | Based on Llama 2 |
| StarCoder2 | 80+ languages | Open training data |
Domain-Specific¶
| Model | Domain | Notes |
|---|---|---|
| Meditron | Medical | Llama-based |
| Lawyer LLM | Legal | Document analysis |
| FinGPT | Finance | Market analysis |
Storage Layout¶
Recommended ZFS dataset structure:
tank/ai/models/
├── gguf/ # GGUF format models
│ ├── llama-3.3-70b-q4_k_m.gguf
│ └── deepseek-coder-v2-16b-q5_k_m.gguf
├── huggingface/ # HF cache directory
│ └── hub/
│ └── models--meta-llama--Llama-3.3-70B-Instruct/
└── ollama/ # Ollama model storage
└── models/
└── blobs/
See Model Volumes for ZFS configuration.
Topics¶
-
Choosing Models
Model selection criteria for different use cases
-
Quantization
Understanding Q4, Q5, Q6, Q8 and their tradeoffs
-
Hugging Face
Downloading models with huggingface-cli
-
GGUF Formats
GGUF file format and conversion
See Also¶
- Inference Engines - Which engine for which format
- Memory Management - Fitting models in memory
- Model Volumes - Storage configuration