Skip to content

GGUF Formats

Understanding the GGUF file format used by llama.cpp and Ollama.

Overview

GGUF (GPT-Generated Unified Format) is the standard format for quantized LLMs:

  • Single file - Model, tokenizer, and metadata in one file
  • Quantization - Built-in support for various precision levels
  • Portability - Works across llama.cpp, Ollama, LM Studio
  • Efficiency - Memory-mapped loading, fast startup

File Structure

GGUF File Layout:
┌─────────────────────────────────────────┐
│             Magic Number                │  4 bytes: "GGUF"
├─────────────────────────────────────────┤
│             Version                     │  4 bytes: v3
├─────────────────────────────────────────┤
│             Tensor Count                │
├─────────────────────────────────────────┤
│         Metadata KV Count               │
├─────────────────────────────────────────┤
│                                         │
│           Metadata Section              │  Architecture, tokenizer,
│         (Key-Value Pairs)               │  quantization info, etc.
│                                         │
├─────────────────────────────────────────┤
│           Tensor Info Section           │  Names, shapes, offsets
├─────────────────────────────────────────┤
│                                         │
│                                         │
│            Tensor Data                  │  Actual weights
│           (Bulk of file)                │
│                                         │
│                                         │
└─────────────────────────────────────────┘

Naming Conventions

Standard Pattern

{model}-{size}-{variant}-{quantization}.gguf

Examples:
llama-3.3-70b-instruct-q4_k_m.gguf
│      │  │   │         └── Quantization type
│      │  │   └── Variant (instruct, chat, base)
│      │  └── Parameter count (70 billion)
│      └── Model version
└── Model family

Split Files

Large models split across multiple files:

llama-3.1-405b-instruct-q4_k_m-00001-of-00004.gguf
llama-3.1-405b-instruct-q4_k_m-00002-of-00004.gguf
llama-3.1-405b-instruct-q4_k_m-00003-of-00004.gguf
llama-3.1-405b-instruct-q4_k_m-00004-of-00004.gguf

llama.cpp automatically loads all parts when you specify the first file.

Inspecting GGUF Files

Using llama.cpp

# Show model info
./llama-cli --model model.gguf --version

# Detailed metadata
./llama-gguf-info model.gguf

Using Python

pip install gguf
from gguf import GGUFReader

reader = GGUFReader("model.gguf")

# Print metadata
for key, value in reader.metadata.items():
    print(f"{key}: {value}")

# Architecture info
print(reader.metadata.get("general.architecture"))
print(reader.metadata.get("general.name"))

Key Metadata Fields

Field Description Example
general.architecture Model type llama
general.name Model name Llama-3.3-70B
general.file_type Quantization Q4_K_M
llama.context_length Max context 131072
llama.embedding_length Hidden size 8192
llama.block_count Layer count 80
tokenizer.ggml.model Tokenizer type llama

Converting to GGUF

From Safetensors/PyTorch

Using llama.cpp convert script:

cd llama.cpp

# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py \
  /path/to/hf/model \
  --outfile model-f16.gguf \
  --outtype f16

# Then quantize
./quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Direct Conversion + Quantization

python convert_hf_to_gguf.py \
  /path/to/hf/model \
  --outfile model-q4_k_m.gguf \
  --outtype q4_k_m

From Other Formats

# Convert older GGML to GGUF
./gguf-convert model.ggml model.gguf

Quantization Process

Basic Quantization

# Available types
./quantize --help

# Quantize to Q4_K_M
./quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Quantize to Q5_K_M
./quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M

With Importance Matrix

For better quality at same size:

# Generate importance matrix from calibration data
./imatrix \
  -m model-f16.gguf \
  -f calibration.txt \
  -o imatrix.dat \
  --chunks 100

# Quantize with imatrix
./quantize \
  model-f16.gguf \
  model-q4_k_m-imat.gguf \
  Q4_K_M \
  --imatrix imatrix.dat

Calibration Data

Use representative text for your use case:

# For code models, use code samples
# For chat models, use conversation samples

# Example: Download wiki text
wget https://huggingface.co/datasets/wikitext/resolve/main/wikitext-2-v1/wiki.test.raw

Quantization Types Reference

Standard Types

Type Bits/Weight Best For
F32 32 Reference only
F16 16 Maximum quality
Q8_0 8 High quality
Q6_K 6.5 Very good quality
Q5_K_M 5.5 Good quality
Q5_K_S 5.0 Smaller Q5
Q4_K_M 4.5 Recommended
Q4_K_S 4.25 Smaller Q4
Q4_0 4.0 Legacy
Q3_K_M 3.5 Memory constrained
Q3_K_S 3.0 Smaller Q3
Q2_K 2.5 Extreme compression

I-Quant Types

Type Description
IQ4_NL 4-bit non-linear
IQ4_XS 4-bit extra small
IQ3_XS 3-bit extra small
IQ3_XXS 3-bit extra extra small
IQ2_XXS 2-bit extra extra small

Verifying GGUF Files

Check Integrity

# Verify file is valid GGUF
./llama-cli --model model.gguf --version

# Quick test inference
./llama-cli -m model.gguf -p "Hello" -n 10

Compare Sizes

# Expected sizes for 70B model
ls -lh *.gguf

# F16:   ~140 GB
# Q8_0:  ~74 GB
# Q6_K:  ~57 GB
# Q5_K_M: ~48 GB
# Q4_K_M: ~43 GB
# Q3_K_M: ~35 GB
# Q2_K:  ~28 GB

Importing to Ollama

Create Modelfile

# Modelfile
FROM /tank/ai/models/gguf/llama-3.3-70b-q4_k_m.gguf

# Set chat template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Import Model

# Create Ollama model from GGUF
ollama create my-llama -f Modelfile

# Verify
ollama list
ollama run my-llama

Storage Recommendations

ZFS Dataset

# Create dataset optimized for large files
zfs create -o recordsize=1M -o compression=off tank/ai/models/gguf

Organization

/tank/ai/models/gguf/
├── llama/
│   ├── llama-3.3-70b-instruct-q4_k_m.gguf
│   └── llama-3.2-8b-instruct-q8_0.gguf
├── qwen/
│   ├── qwen2.5-72b-instruct-q4_k_m.gguf
│   └── qwen2.5-coder-32b-q5_k_m.gguf
└── deepseek/
    └── deepseek-coder-v2-16b-q8_0.gguf

See Also