Fine-Tuning LLMs¶
Fine-tuning adapts pre-trained models to specific tasks or domains, improving performance on your use case.
When to Fine-Tune¶
Fine-tuning is useful when:
- Domain expertise needed - Legal, medical, technical jargon
- Consistent style required - Specific tone, format, or structure
- Task-specific behavior - Classification, extraction, structured output
- RAG isn't enough - Model needs to internalize knowledge, not just retrieve it
Fine-Tuning vs Alternatives¶
| Approach | Best For | Effort | Cost |
|---|---|---|---|
| Prompt engineering | Quick iterations | Low | Low |
| RAG | Dynamic knowledge | Medium | Medium |
| Fine-tuning | Behavior change | High | High |
| Continued pre-training | Large domain shift | Very high | Very high |
Methods¶
Full Fine-Tuning¶
Updates all model weights: - Requires significant GPU memory (model size + gradients + optimizer states) - Best results but most resource-intensive - Risk of catastrophic forgetting
LoRA (Low-Rank Adaptation)¶
Updates small adapter layers: - Much lower memory requirements - Faster training - Can swap adapters at inference time - Most practical for local fine-tuning
QLoRA¶
Quantized LoRA: - Combines 4-bit quantization with LoRA - Enables fine-tuning on consumer GPUs - Minimal quality loss
In This Section¶
| Document | Description |
|---|---|
| LoRA | Low-Rank Adaptation fine-tuning |
| Training | Preparing data and running training |
| Inference | Using fine-tuned models |
Hardware Requirements¶
Minimum (QLoRA, 7B model)¶
- GPU: 8GB VRAM (RTX 3070 or better)
- RAM: 16GB
- Storage: 50GB
Recommended (LoRA, 7B model)¶
- GPU: 24GB VRAM (RTX 3090/4090)
- RAM: 32GB
- Storage: 100GB
Full Fine-Tuning (7B model)¶
- GPU: 80GB+ VRAM (A100, H100)
- RAM: 64GB+
- Storage: 200GB+
Quick Start¶
Unsloth (Fastest)¶
from unsloth import FastLanguageModel
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
# Train (see training.md for details)
Using with Ollama¶
After fine-tuning, convert and use with Ollama:
# Create Modelfile
cat > Modelfile <<EOF
FROM ./model-merged
TEMPLATE "{{ .Prompt }}"
EOF
# Create Ollama model
ollama create my-finetuned -f Modelfile
# Run
ollama run my-finetuned
Common Use Cases¶
Instruction Following¶
Train the model to follow specific instruction formats:
Code Generation¶
Specialize for a programming language or framework:
Classification¶
Train for consistent categorization:
Structured Output¶
Train to produce specific JSON structures:
See Also¶
- LoRA Guide
- Training Guide
- Models Overview
- RAG - Alternative to fine-tuning