Skip to content

RAG (Retrieval-Augmented Generation)

RAG combines the power of large language models with external knowledge retrieval, enabling accurate responses grounded in your own documents and data.

Why RAG?

LLMs have knowledge cutoffs and can hallucinate. RAG solves this by:

  • Grounding responses in actual documents
  • Reducing hallucinations through factual retrieval
  • Keeping knowledge current without retraining
  • Domain-specific expertise from your own data

Core Components

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │ Documents│───>│ Chunking │───>│Embeddings│              │
│  └──────────┘    └──────────┘    └──────────┘              │
│                                        │                    │
│                                        v                    │
│                               ┌──────────────┐              │
│                               │Vector Database│             │
│                               └──────────────┘              │
│                                        │                    │
│  ┌──────────┐    ┌──────────┐         │                    │
│  │  Query   │───>│ Retrieval│<────────┘                    │
│  └──────────┘    └──────────┘                              │
│                        │                                    │
│                        v                                    │
│               ┌──────────────┐    ┌──────────┐             │
│               │   Context    │───>│   LLM    │             │
│               │  Augmented   │    │ Response │             │
│               └──────────────┘    └──────────┘             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

In This Section

Document Description
Fundamentals Core concepts, chunking strategies, retrieval methods
Implementation Building a RAG pipeline from scratch
Optimization Improving retrieval quality and performance
Integration Connecting RAG to your applications

Quick Start

Minimal RAG with LangChain

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Create embeddings and store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# Create retrieval chain
llm = Ollama(model="llama3.2")
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# Query
response = qa.invoke("What is the main topic of these documents?")
print(response["result"])

Technology Stack

Embedding Models (Local)

Model Dimensions Use Case
nomic-embed-text 768 General purpose, fast
mxbai-embed-large 1024 Higher quality, slower
all-minilm 384 Lightweight, mobile
bge-m3 1024 Multilingual

Vector Databases

Database Best For
ChromaDB Local development, simplicity
Qdrant Production, scalability
Milvus Large scale, enterprise
pgvector PostgreSQL integration

LLM Integration

All local inference engines work with RAG:

See Also