Skip to content

RAG (Retrieval-Augmented Generation)¶

RAG combines the power of large language models with external knowledge retrieval, enabling accurate responses grounded in your own documents and data.

Why RAG?¶

LLMs have knowledge cutoffs and can hallucinate. RAG solves this by:

Grounding responses in actual documents
Reducing hallucinations through factual retrieval
Keeping knowledge current without retraining
Domain-specific expertise from your own data

Core Components¶

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │ Documents│───>│ Chunking │───>│Embeddings│              │
│  └──────────┘    └──────────┘    └──────────┘              │
│                                        │                    │
│                                        v                    │
│                               ┌──────────────┐              │
│                               │Vector Database│             │
│                               └──────────────┘              │
│                                        │                    │
│  ┌──────────┐    ┌──────────┐         │                    │
│  │  Query   │───>│ Retrieval│<────────┘                    │
│  └──────────┘    └──────────┘                              │
│                        │                                    │
│                        v                                    │
│               ┌──────────────┐    ┌──────────┐             │
│               │   Context    │───>│   LLM    │             │
│               │  Augmented   │    │ Response │             │
│               └──────────────┘    └──────────┘             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

In This Section¶

Document	Description
Fundamentals	Core concepts, chunking strategies, retrieval methods
Implementation	Building a RAG pipeline from scratch
Optimization	Improving retrieval quality and performance
Integration	Connecting RAG to your applications

Quick Start¶

Minimal RAG with LangChain¶

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Create embeddings and store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# Create retrieval chain
llm = Ollama(model="llama3.2")
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# Query
response = qa.invoke("What is the main topic of these documents?")
print(response["result"])

Technology Stack¶

Embedding Models (Local)¶

Model	Dimensions	Use Case
nomic-embed-text	768	General purpose, fast
mxbai-embed-large	1024	Higher quality, slower
all-minilm	384	Lightweight, mobile
bge-m3	1024	Multilingual

Vector Databases¶

Database	Best For
ChromaDB	Local development, simplicity
Qdrant	Production, scalability
Milvus	Large scale, enterprise
pgvector	PostgreSQL integration

LLM Integration¶

All local inference engines work with RAG:

Ollama - Easiest integration
llama.cpp - Maximum performance
vLLM - High throughput serving

See Also¶