RAG Fundamentals¶
Understanding the core concepts behind Retrieval-Augmented Generation.
The RAG Process¶
1. Document Ingestion¶
Transform raw documents into searchable chunks:
Supported formats: - Text: .txt, .md, .rst - Documents: .pdf, .docx, .pptx - Code: .py, .js, .ts, etc. - Data: .json, .csv, .xml
2. Chunking Strategies¶
How you split documents significantly impacts retrieval quality.
Fixed-Size Chunking¶
Simple but can break context:
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n"
)
Recursive Chunking (Recommended)¶
Respects document structure:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
Semantic Chunking¶
Groups by meaning, not size:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")
Code-Aware Chunking¶
For codebases:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200
)
Chunk Size Guidelines¶
| Content Type | Chunk Size | Overlap |
|---|---|---|
| General text | 500-1000 | 100-200 |
| Technical docs | 1000-1500 | 200-300 |
| Code | 1500-2000 | 200-400 |
| Q&A/FAQ | 200-500 | 50-100 |
Embeddings¶
What Are Embeddings?¶
Numerical representations of text that capture semantic meaning:
Similar meanings produce similar vectors, enabling semantic search.
Local Embedding Models¶
# Pull embedding models with Ollama
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull all-minilm
Using Embeddings¶
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Single text
vector = embeddings.embed_query("What is machine learning?")
print(f"Dimensions: {len(vector)}") # 768
# Batch embedding
vectors = embeddings.embed_documents([
"Machine learning is...",
"Deep learning uses...",
"Neural networks are..."
])
Retrieval Methods¶
Similarity Search¶
Find the k most similar documents:
# Basic similarity search
docs = vectorstore.similarity_search(query, k=4)
# With score threshold
docs = vectorstore.similarity_search_with_score(query, k=4)
relevant = [(doc, score) for doc, score in docs if score > 0.7]
Maximum Marginal Relevance (MMR)¶
Balance relevance with diversity:
docs = vectorstore.max_marginal_relevance_search(
query,
k=4,
fetch_k=20, # Fetch more, then diversify
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
Hybrid Search¶
Combine semantic and keyword search:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Keyword-based retriever
bm25 = BM25Retriever.from_documents(documents)
bm25.k = 4
# Semantic retriever
semantic = vectorstore.as_retriever(search_kwargs={"k": 4})
# Combine with weights
hybrid = EnsembleRetriever(
retrievers=[bm25, semantic],
weights=[0.3, 0.7]
)
Context Window Management¶
The Challenge¶
LLMs have limited context windows. You must fit: - System prompt - Retrieved documents - User query - Space for response
Strategies¶
1. Limit Retrieved Chunks¶
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3} # Fewer, more relevant chunks
)
2. Compress Retrieved Content¶
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever
)
3. Rerank Results¶
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank
# Or use local reranker
reranker = CohereRerank(top_n=3)
rerank_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=retriever
)
Prompt Engineering for RAG¶
Basic RAG Prompt¶
template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't know."
Context:
{context}
Question: {question}
Answer:"""
Structured RAG Prompt¶
template = """You are a helpful assistant answering questions based on provided documentation.
INSTRUCTIONS:
1. Only use information from the context below
2. If the context doesn't contain the answer, say so
3. Cite sources when possible
4. Be concise but complete
CONTEXT:
{context}
USER QUESTION: {question}
ANSWER:"""
Evaluation Metrics¶
Retrieval Quality¶
| Metric | Description |
|---|---|
| Precision@k | Relevant docs in top k results |
| Recall@k | Relevant docs retrieved vs total relevant |
| MRR | Mean Reciprocal Rank of first relevant result |
| NDCG | Normalized Discounted Cumulative Gain |
Generation Quality¶
| Metric | Description |
|---|---|
| Faithfulness | Is the answer grounded in context? |
| Answer relevance | Does it answer the question? |
| Context relevance | Was the right context retrieved? |
RAGAS Evaluation¶
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
Common Pitfalls¶
1. Chunks Too Large¶
- Retrieves irrelevant content
- Wastes context window
- Fix: Smaller chunks with overlap
2. Chunks Too Small¶
- Loses context
- Fragments information
- Fix: Larger chunks or parent document retrieval
3. Poor Embedding Model Choice¶
- Misses semantic matches
- Language mismatch
- Fix: Use domain-appropriate embeddings
4. Ignoring Metadata¶
- Can't filter by source, date, etc.
- Fix: Store and use document metadata
doc = Document(
page_content="...",
metadata={
"source": "manual.pdf",
"page": 42,
"date": "2024-01-15"
}
)