Table of Contents
Concept 01
Why RAG? The 3 Problems It Solves
Without RAG, using LLMs in your application means accepting three serious problems:
Problem 1: Hallucination. LLMs confidently generate plausible-sounding text that's factually wrong. Ask GPT-4 about your company's refund policy and it will make something up that sounds reasonable but is completely wrong. The model has no idea what your policy actually says — it just pattern-matches to what refund policies typically say.
Problem 2: Private data blindness. LLMs are trained on public internet data. They know nothing about your internal documentation, customer conversations, product manuals, or proprietary database contents. Fine-tuning helps but is expensive, slow to update, and still doesn't let you update the model with today's data.
Problem 3: Context window limits. Even with a 200k token context window, you can't stuff your entire company knowledge base into every API call. It's too expensive, too slow, and the model degrades in quality when the context is full of irrelevant information.
RAG solves all three: it retrieves only the relevant documents for each query and injects them as context. The LLM is now grounded in real, current, private data — and the context window is filled with relevant information, not noise.
Concept 02
RAG Architecture — Two Phases
Concept 03
The Indexing Pipeline — Load, Chunk, Embed, Store
import os
import chromadb
from pathlib import Path
from openai import OpenAI
client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./rag_db")
# --- STEP 1: LOAD ---
def load_text_file(file_path: str) -> str:
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
def load_documents_from_directory(directory: str) -> list[dict]:
"""Load all .txt and .md files from a directory."""
docs = []
for file_path in Path(directory).rglob("*.txt"):
text = load_text_file(str(file_path))
docs.append({
"text": text,
"source": str(file_path),
"filename": file_path.name,
})
return docs
# --- STEP 2: CHUNK ---
def chunk_document(
text: str,
chunk_size: int = 500, # tokens (~400 words)
chunk_overlap: int = 50, # overlap tokens for context continuity
) -> list[str]:
"""
Split document into overlapping chunks by word count.
For production, use sentence-aware chunking to avoid mid-sentence splits.
"""
words = text.split()
chunks = []
step = chunk_size - chunk_overlap
for i in range(0, len(words), step):
chunk = " ".join(words[i:i + chunk_size])
if chunk.strip():
chunks.append(chunk)
return chunks
# --- STEP 3 & 4: EMBED AND STORE ---
def get_embedding_batch(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=[t.replace("\n", " ") for t in texts],
)
return [item.embedding for item in response.data]
def build_index(documents_dir: str, collection_name: str = "knowledge_base"):
"""
Complete indexing pipeline: Load → Chunk → Embed → Store.
"""
# Load documents
raw_docs = load_documents_from_directory(documents_dir)
print(f"Loaded {len(raw_docs)} documents")
# Get or create collection
try:
chroma_client.delete_collection(collection_name)
except Exception:
pass
collection = chroma_client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"},
)
# Process each document
total_chunks = 0
for doc in raw_docs:
chunks = chunk_document(doc["text"])
print(f" {doc['filename']}: {len(chunks)} chunks")
# Embed and store in batches of 50
batch_size = 50
for i in range(0, len(chunks), batch_size):
batch_chunks = chunks[i:i + batch_size]
batch_embeddings = get_embedding_batch(batch_chunks)
# Generate unique IDs for each chunk
ids = [f"{doc['filename']}_chunk_{i + j}" for j in range(len(batch_chunks))]
metadatas = [{"source": doc["source"], "filename": doc["filename"]}
for _ in batch_chunks]
collection.add(
documents=batch_chunks,
embeddings=batch_embeddings,
metadatas=metadatas,
ids=ids,
)
total_chunks += len(batch_chunks)
print(f"\nIndex built: {total_chunks} total chunks stored")
return collection
Concept 04
Chunking Strategies — The Choice That Matters Most
| Strategy | How it works | Pros | Cons | Best for |
|---|---|---|---|---|
| Fixed size | Split every N tokens/words | Simple, predictable | Splits mid-sentence | Prototypes, homogeneous text |
| Sentence-aware | Split at sentence boundaries | No broken sentences | Variable chunk sizes | Articles, documentation |
| Semantic | Group semantically related sentences | Best retrieval quality | Expensive (embeds during chunking) | High-quality production RAG |
| Hierarchical | Parent doc + child chunks | Retrieves detail but returns parent | Complex indexing | Long technical documents |
import re
from typing import List
def chunk_by_sentences(
text: str,
sentences_per_chunk: int = 5,
overlap_sentences: int = 1,
) -> List[str]:
"""
Sentence-aware chunking.
Never splits mid-sentence. Maintains context with overlap.
"""
# Split into sentences
sentence_endings = r'(?<=[.!?])\s+'
sentences = re.split(sentence_endings, text.strip())
sentences = [s.strip() for s in sentences if s.strip()]
chunks = []
step = sentences_per_chunk - overlap_sentences
for i in range(0, len(sentences), step):
chunk_sentences = sentences[i:i + sentences_per_chunk]
if chunk_sentences:
chunks.append(" ".join(chunk_sentences))
return chunks
Poor chunking is the most common reason RAG systems give bad answers. Too small: chunks lack context to answer anything. Too large: irrelevant text dilutes the relevant part. Ideal for most use cases: 300–500 tokens with 10–15% overlap. If your retrieved chunks don't actually contain the answer, your LLM can't generate it.
Concept 05
The Query Pipeline — Embed, Retrieve, Augment, Generate
def retrieve_context(
query: str,
collection: chromadb.Collection,
top_k: int = 5,
min_similarity: float = 0.7,
) -> list[dict]:
"""
Retrieve relevant chunks for a query.
Filters out low-similarity results.
"""
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query.replace("\n", " "),
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
# Convert distances to similarities (ChromaDB cosine distance = 1 - cosine_similarity)
chunks = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
similarity = 1 - dist
if similarity >= min_similarity:
chunks.append({
"text": doc,
"source": meta.get("filename", "unknown"),
"similarity": similarity,
})
return chunks
def build_augmented_prompt(query: str, retrieved_chunks: list[dict]) -> str:
"""
Inject retrieved context into the prompt.
Clear source attribution reduces hallucination.
"""
if not retrieved_chunks:
return query
context_parts = []
for i, chunk in enumerate(retrieved_chunks, 1):
context_parts.append(
f"[Source {i}: {chunk['source']}]\n{chunk['text']}"
)
context = "\n\n".join(context_parts)
return f"""Use the following context to answer the question.
If the answer is not in the context, say "I don't have information about that."
Do not make up information.
CONTEXT:
{context}
QUESTION: {query}"""
Concept 06
The Complete Working RAG Pipeline
class RAGSystem:
"""
Complete production-ready RAG system.
Supports indexing, querying, and source citation.
"""
SYSTEM_PROMPT = """You are a helpful assistant that answers questions based strictly
on provided context documents. Follow these rules:
1. Only use information from the provided context
2. If the context doesn't contain the answer, say so explicitly
3. Always cite which source document your answer comes from
4. Never fabricate information or draw on general knowledge"""
def __init__(self, collection_name: str = "knowledge_base"):
self.collection_name = collection_name
self._collection = None
@property
def collection(self):
if self._collection is None:
self._collection = chroma_client.get_collection(self.collection_name)
return self._collection
def index(self, documents_dir: str):
"""Index all documents in directory."""
self._collection = build_index(documents_dir, self.collection_name)
def query(
self,
question: str,
top_k: int = 5,
min_similarity: float = 0.65,
model: str = "gpt-4o-mini",
) -> dict:
"""
Full RAG query: retrieve context → augment prompt → generate answer.
Returns answer, sources, and retrieved chunks for debugging.
"""
# Step 1: Retrieve relevant context
retrieved_chunks = retrieve_context(
question,
self.collection,
top_k=top_k,
min_similarity=min_similarity,
)
if not retrieved_chunks:
return {
"answer": "I couldn't find relevant information in the knowledge base to answer this question.",
"sources": [],
"chunks_retrieved": 0,
}
# Step 2: Build augmented prompt
augmented_query = build_augmented_prompt(question, retrieved_chunks)
# Step 3: Generate answer
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": self.SYSTEM_PROMPT},
{"role": "user", "content": augmented_query},
],
temperature=0, # Deterministic for factual Q&A
)
answer = response.choices[0].message.content
return {
"answer": answer,
"sources": list(set(c["source"] for c in retrieved_chunks)),
"chunks_retrieved": len(retrieved_chunks),
"top_similarity": max(c["similarity"] for c in retrieved_chunks),
"retrieved_chunks": retrieved_chunks, # For debugging
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
}
}
def chat(self, conversation_history: list[dict], new_question: str) -> dict:
"""
RAG with conversation history.
Uses the full conversation for better query understanding.
"""
# Build context-aware query from conversation history
if conversation_history:
history_text = "\n".join([
f"{m['role'].upper()}: {m['content']}"
for m in conversation_history[-4:] # Last 2 turns
])
contextualized_query = f"""Given this conversation:
{history_text}
Answer this follow-up question: {new_question}"""
else:
contextualized_query = new_question
return self.query(contextualized_query)
# Complete usage example
rag = RAGSystem()
# Index your documents (do once)
rag.index("./company_docs")
# Query
result = rag.query("What is the company's return policy for software products?")
print("Answer:", result["answer"])
print("Sources:", result["sources"])
print(f"Retrieved {result['chunks_retrieved']} chunks")
print(f"Top similarity: {result['top_similarity']:.3f}")
Concept 07
Evaluating RAG — How to Know If It's Actually Working
RAG systems need systematic evaluation. Vibe-checking is not enough. Here are the key metrics and how to measure them:
| Metric | What it measures | Target | How to measure |
|---|---|---|---|
| Context Precision | Of retrieved chunks, what % were actually relevant? | > 0.8 | LLM-as-judge: "Is this chunk relevant to the question?" |
| Context Recall | Did retrieval find all relevant chunks? | > 0.7 | Requires ground-truth annotations |
| Answer Faithfulness | Is the answer supported by retrieved context? | > 0.9 | LLM-as-judge: "Is every claim in the answer supported by context?" |
| Answer Relevance | Does the answer actually address the question? | > 0.8 | LLM-as-judge or human eval |
def evaluate_rag_faithfulness(
question: str,
answer: str,
retrieved_context: list[str],
) -> dict:
"""
LLM-as-judge for answer faithfulness.
Checks if the answer makes claims not supported by the context.
"""
context_text = "\n\n".join(retrieved_context)
eval_prompt = f"""You are evaluating a RAG system's answer for faithfulness.
CONTEXT DOCUMENTS:
{context_text}
QUESTION: {question}
ANSWER: {answer}
Evaluate: Does every factual claim in the ANSWER appear in the CONTEXT DOCUMENTS?
Return JSON:
{{
"faithful": true/false,
"score": 0.0-1.0,
"unsupported_claims": ["claim1", "claim2"] or [],
"reasoning": "brief explanation"
}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
Concept 08
8 Common RAG Failures and How to Fix Them
Symptom: Correct answer exists in your docs but LLM says "I don't have that information."
Fix: Check similarity scores. If top result scores below 0.7, your chunking is breaking up the relevant passage. Try larger chunks or sentence-aware chunking.
Symptom: Retrieved chunks contain the answer but LLM answers from training data.
Fix: Strengthen your system prompt: "You MUST only use the provided context. Never draw on training knowledge." Also reduce temperature to 0.
Symptom: Retrieved chunks are semantically close but don't contain complete answers.
Fix: Increase chunk size from 200 → 500 tokens. Add overlap. Consider parent-child chunking: embed small chunks but retrieve their parent document.
Symptom: You re-index but existing queries don't work anymore.
Fix: Always embed queries with the same model used to embed documents. Store the model name in collection metadata and validate on startup.
Symptom: Documents have been updated but RAG still returns old information.
Fix: Implement incremental indexing — track document modification timestamps and re-embed changed files. Or rebuild the full index on a schedule.
Symptom: "What is the CEO's email?" fails because CEO name and email are in different chunks.
Fix: Query decomposition — break complex questions into sub-queries, retrieve for each, then synthesize.
Symptom: Too low (k=1): misses relevant context. Too high (k=20): context window fills with noise.
Fix: Use k=5 as default. Add similarity threshold (0.65+) to filter low-relevance results. Monitor average similarity of retrieved chunks.
Symptom: Users can't verify answers; trust degrades over time.
Fix: Always include source attribution in your system prompt. Return source filenames in your API response. Let users click through to the original document.
- RAG = Retrieval + Augmented + Generation: retrieve relevant docs, inject as context, generate grounded answer
- Two phases: indexing (offline, one-time) and querying (real-time, per request)
- Chunking strategy is the #1 quality factor — use 300-500 tokens with 10-15% overlap
- Always include a minimum similarity threshold to avoid injecting irrelevant context
- Evaluate with faithfulness (answer supported by context?) and relevance (answer addresses question?)
- Most RAG failures are retrieval failures, not generation failures — fix the retrieval first