Contents
Imagine you're building a customer support chatbot for a SaaS product. The product has 500 help articles, a changelog that updates weekly, and pricing that changes quarterly. You ask GPT-4: "What's the refund policy?" It confidently gives you an answer — one that was true in 2023 but was updated six months ago.
This is the problem RAG solves. And it's why every serious AI product you've heard of — Notion AI, GitHub Copilot, Perplexity, every enterprise chatbot — is built on top of some version of this pattern.
Section 01
Why RAG Exists — 3 Problems It Solves
Large language models are trained on a snapshot of the internet up to a certain date. After that, they know nothing new. But that's just one of three problems:
- Stale knowledge. GPT-4's training data has a cutoff. It doesn't know about your product's latest features, your company's new policies, or anything that happened last month.
- Private data blindness. Your internal documents, customer data, and proprietary knowledge were never in the training set. The model literally cannot know what's in your Confluence wiki.
- Hallucination. When a model doesn't know something, it doesn't say "I don't know." It generates a plausible-sounding answer. This is catastrophic in production — a customer support bot that invents refund policies is a legal liability.
RAG fixes all three by giving the model a reference library to look things up in, at query time. Instead of relying on baked-in knowledge, the model reads the relevant documents and then answers. It's the difference between a doctor who studied medicine once and a doctor who looks up the latest guidelines before prescribing.
RAG doesn't make the model smarter. It makes the model better informed. The LLM's job shifts from "know the answer" to "read these documents and synthesize an answer." That's a much easier job — and one where hallucination is far less likely.
Section 02
RAG vs Fine-Tuning — Which Do You Need?
This is one of the most common interview questions you'll get. The answer is not "RAG is always better" — it's "it depends on what problem you're solving."
Think of it this way: fine-tuning teaches the model new skills or behaviors. RAG gives the model new knowledge to look up. A doctor who went to medical school (fine-tuning) still needs to look up the latest drug interaction database (RAG). The two are not mutually exclusive.
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Best for | Knowledge that changes, private documents, facts | New style, tone, format, or reasoning patterns |
| Knowledge updates | Add a document, done. Minutes. | Retrain the model. Hours to days. |
| Cost | Vector DB + embedding API calls | GPU hours for training + serving costs |
| Transparency | Can cite which document the answer came from | Model "knows" but can't cite sources |
| Hallucination risk | Lower — answer is grounded in retrieved text | Higher — relies on weights, not documents |
| Data required | Just the documents themselves | Labelled (input, output) training examples |
| Use both when | Fine-tune for style/behavior + RAG for knowledge. E.g., a fine-tuned "PrepFlix assistant" persona that retrieves course-specific answers. | |
Teams often reach for fine-tuning when RAG would solve the problem in 1/10 the time and cost. Fine-tuning is the right answer only when you need the model to behave differently — new output format, domain-specific reasoning, or a persona. If you just need it to know new facts, use RAG.
Section 03
The Indexing Pipeline — Building the Knowledge Base
Before any user can ask a question, you need to prepare your documents. This is the indexing pipeline — it runs once upfront (and again whenever documents change). Think of it as building a library: you're organizing the books so they can be found quickly later.
- Load & parse. Extract clean text from PDFs, HTML, Markdown, Notion exports. Use libraries like
pypdf,unstructured, ormarkdownify. Strip headers, footers, page numbers — anything that's noise, not signal. - Chunk. Split the text into smaller segments. This is more art than science — we'll spend a full section on it. The key constraint: each chunk must fit inside the embedding model's token limit (usually 512 tokens) and carry enough context to stand alone.
- Embed. Pass each chunk through an embedding model. The model converts text to a dense vector — a list of ~768 floating-point numbers — that encodes the semantic meaning. Similar-meaning text → similar vectors.
- Store. Save each chunk with its vector and metadata (source file, page number, date, category) in a vector database. The metadata lets you filter later — "only search docs updated in the last 30 days."
- Index. The vector DB builds an HNSW (Hierarchical Navigable Small World) index over the stored vectors. This is what enables fast approximate nearest-neighbor search at query time — milliseconds instead of seconds.
In production, you don't re-index everything every time. Compute a hash (MD5 or SHA-256) of each document's content. If the hash hasn't changed since last run, skip it. Only embed and upsert documents that are new or changed. This keeps your indexing pipeline fast and cheap.
Section 04
Chunking — The Most Underrated Decision in RAG
Here's a dirty secret about RAG: the quality of your answers depends more on how you chunk your documents than on which LLM you use. Bad chunking + GPT-4 gives worse results than good chunking + GPT-4o-mini. Let me explain why.
When a user asks "What's the refund policy for annual subscriptions?", your retrieval system finds the chunks whose embeddings are closest to that question. If the relevant sentence is in the middle of a 2,000-word chunk, the entire chunk's embedding is diluted — the needle is hidden in a haystack of irrelevant words. The embedding won't be close enough to retrieve it. You've already failed before the LLM sees anything.
The Four Main Strategies
| Strategy | How It Works | Best For | Watch Out For |
|---|---|---|---|
| Fixed-size | Split every N tokens with M overlap. Simple, deterministic. | General docs, quick prototyping | May split mid-sentence, mid-thought |
| Recursive / semantic | Try to split at paragraph → sentence → word boundaries in priority order | Well-structured prose, articles | Uneven chunk sizes |
| Document-structure-aware | Split at headings, sections (Markdown ##, HTML <h2>) | Structured docs: wikis, READMEs, legal contracts | Sections can be too long or too short |
| Hierarchical (parent-child) | Small child chunks for retrieval, large parent chunk injected into LLM | When precision + context both matter | More complex to implement |
Choosing Chunk Size — The 400-Token Sweet Spot
There's real tension here. Smaller chunks = more precise retrieval (the embedding focuses on one idea). Larger chunks = more context in each result (the LLM has more to work with). Here's how to think about it:
- Too small (under 100 tokens): Each chunk is a sentence or two. Embeddings are precise but chunks often lack context — the LLM gets "the policy is 30 days" without knowing which product or what the exceptions are.
- Too large (over 1000 tokens): Each chunk is several paragraphs. One chunk might cover the refund policy, return shipping, and exceptions all at once. The embedding is diluted. Retrieval misses more often.
- Sweet spot (300–500 tokens): Roughly one complete thought — a policy section, a code function, a FAQ answer. Dense enough to retrieve, rich enough to understand.
Always add 10–20% overlap between adjacent chunks. If your chunk size is 400 tokens, overlap by 40–80 tokens. This ensures that information spanning a chunk boundary (like a sentence that starts at the end of chunk 3 and continues into chunk 4) isn't lost.
The hierarchical pattern is worth knowing for interviews. You store two versions of each section: a small child chunk (100–150 tokens, precise for retrieval) and the full parent section (500–800 tokens, rich for the LLM). Retrieve using child chunks. Inject the parent into the prompt. Best of both worlds.
Section 05
The Query Pipeline — Answering in Real Time
When a user sends a question, the query pipeline runs. Unlike the indexing pipeline (which runs offline), this runs live — latency matters. Every step adds time, so you need to be deliberate about what to include.
Let's walk through the code for each step. I'll use Python with the OpenAI SDK — the patterns work identically with any provider.
import openai
import numpy as np
client = openai.OpenAI()
# ── Step 1: Embed the query ──────────────────────────────────
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# ── Step 2: Retrieve from vector DB ─────────────────────────
# (Using a hypothetical vector_db client — same pattern for
# Pinecone, Qdrant, Weaviate, Firestore, etc.)
def retrieve(query: str, top_k: int = 20) -> list[dict]:
query_vector = embed(query)
results = vector_db.query(
vector=query_vector,
top_k=top_k,
include_metadata=True
)
return results # list of {text, metadata, score}
# ── Step 3: Generate answer ──────────────────────────────────
def rag_answer(user_question: str) -> str:
chunks = retrieve(user_question, top_k=5)
context = "\n\n---\n\n".join(c["text"] for c in chunks)
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.2,
messages=[
{
"role": "system",
"content": """You are a helpful assistant.
Answer the user's question using ONLY the context below.
If the context doesn't contain the answer, say:
'I don't have information on that in my knowledge base.'
Never make up information.
Context:
""" + context
},
{"role": "user", "content": user_question}
]
)
return response.choices[0].message.content
If you index documents with text-embedding-3-small and query with text-embedding-3-large, you're comparing apples to oranges. The vector spaces are completely different. Retrieval will return garbage. This is a surprisingly common production bug.
Section 06
Re-ranking — When "Close Enough" Isn't Good Enough
The ANN search uses a bi-encoder: it embeds the query and each document separately, then compares the vectors. This is fast — a single lookup against a pre-built index. But bi-encoders have a weakness: they embed the query and document in isolation, without any awareness of each other.
A cross-encoder takes the query and a candidate document together and scores them jointly. It can see the full interaction between the two texts, giving much more precise relevance scores. The trade-off: you can't pre-compute cross-encoder scores (they depend on the query), so you can only run them on a small candidate set.
This is why re-ranking uses a two-stage approach:
- Stage 1 — Bi-encoder retrieval: Fast ANN search, retrieve top-20 candidates. Miss rate ~5%. Cheap: one vector lookup.
- Stage 2 — Cross-encoder re-ranking: Score the top-20 candidates against the query. Re-order by precision. Return top-5. Expensive: 20 forward passes. But only 20, not millions.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
# Score each candidate against the query
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
# Sort by score descending, return top N
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
return [item for item, score in ranked[:top_n]]
# Updated query pipeline:
candidates = retrieve(user_question, top_k=20) # bi-encoder, fast
top_chunks = rerank(user_question, candidates) # cross-encoder, precise
Re-ranking adds ~50–200ms latency. Skip it in the prototype. Add it when you measure that retrieval precision is hurting answer quality (RAGAS context precision below 0.75). At scale, Cohere's Rerank API is a plug-and-play alternative to running your own cross-encoder.
Section 07
HyDE — When the Query Itself Is the Problem
Here's a subtle issue with standard RAG: queries and documents live in different "spaces" in the embedding model. A user asks a short, informal question: "how do i cancel?" Your document contains a formal paragraph: "Section 4.2: Cancellation Procedures — To terminate your subscription, navigate to Account Settings..."
These two texts look different to an embedding model, even though they're clearly related. The query embedding and document embedding aren't as close as they should be.
HyDE (Hypothetical Document Embeddings) is a clever fix. Instead of embedding the user's raw question, you first ask the LLM to generate a hypothetical answer — as if it already knew. Then you embed that hypothetical answer. A hypothetical answer uses the same vocabulary, style, and structure as real documents — so it lands much closer to the actual document in embedding space.
def hyde_retrieve(user_question: str, top_k: int = 5) -> list[dict]:
# Step 1: Generate hypothetical answer
hyp_response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.5,
messages=[{
"role": "user",
"content": f"""Write a short passage that answers the following question,
as if it were from an official help document.
Question: {user_question}
Passage:"""
}]
)
hypothetical_doc = hyp_response.choices[0].message.content
# Step 2: Embed the hypothetical answer (not the question)
hyp_vector = embed(hypothetical_doc)
# Step 3: Search with the hypothetical vector
results = vector_db.query(vector=hyp_vector, top_k=top_k)
return results
Yes — the original HyDE paper showed 10–20% improvement in retrieval recall on several benchmarks. The downside: you're adding one LLM call before retrieval, adding ~1–2 seconds of latency and cost. Use it selectively for short, ambiguous queries — not every query needs it.
Section 08
Conversation History — Making Multi-Turn RAG Work
Most RAG tutorials show you single-turn Q&A. But real products need conversations. And conversations break naive RAG immediately.
Here's why. The user asks: "What's the refund policy?" Your RAG retrieves the right document. Great. The user follows up: "Does that apply to annual plans too?" Your RAG embeds that question and searches — but "Does that apply to annual plans too?" has no context. What's "that"? The embedding model has no idea. Retrieval fails.
The fix is query rewriting: before embedding the user's latest message, use the LLM to rewrite it into a standalone question that includes all relevant context.
def rewrite_query(conversation_history: list, latest_message: str) -> str:
"""Turn a follow-up question into a standalone search query."""
history_text = "\n".join([
f"{msg['role']}: {msg['content']}"
for msg in conversation_history[-4:] # last 4 turns
])
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[{
"role": "user",
"content": f"""Given this conversation history:
{history_text}
Rewrite the latest message as a complete, standalone question
that can be understood without the conversation history.
Latest message: "{latest_message}"
Standalone question:"""
}]
)
return response.choices[0].message.content
# Usage:
standalone_q = rewrite_query(history, "Does that apply to annual plans too?")
# → "Does the 30-day refund policy apply to annual subscription plans?"
chunks = retrieve(standalone_q) # now retrieves correctly
Section 09
RAGAS — Measuring Whether Your RAG Actually Works
This is where most teams fail. They build the RAG system, try a few questions manually, think "looks good," and ship it. Then users complain that it gives wrong answers half the time.
Manual testing doesn't scale. You need automated metrics. RAGAS (Retrieval-Augmented Generation Assessment) is the standard framework. It gives you four key numbers:
| Metric | Question It Answers | What Low Score Means | Target |
|---|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | LLM is hallucinating beyond the context | > 0.85 |
| Answer Relevancy | Does the answer actually address the question? | Answer is vague or off-topic | > 0.80 |
| Context Precision | Are the retrieved chunks relevant to the question? | Retrieval is noisy — pulling irrelevant chunks | > 0.75 |
| Context Recall | Did retrieval find all the information needed to answer? | Chunking or retrieval is missing key information | > 0.75 |
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall
)
from datasets import Dataset
# Build a golden test set: 50-100 questions with known correct answers
test_data = {
"question": ["What is the refund policy?", "How do I cancel?"],
"ground_truth": ["Refunds are available within 30 days of purchase.",
"Cancel via Account Settings > Subscription > Cancel."],
"contexts": [
# The chunks your system retrieved for each question
["Our refund policy allows cancellation within 30 days..."],
["To cancel your subscription, go to Account Settings..."],
],
"answer": [
rag_answer("What is the refund policy?"),
rag_answer("How do I cancel?"),
]
}
results = evaluate(
Dataset.from_dict(test_data),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# {'faithfulness': 0.91, 'answer_relevancy': 0.88,
# 'context_precision': 0.82, 'context_recall': 0.79}
Build a set of 50–100 (question, expected_answer) pairs. Run RAGAS on it before every deploy. If faithfulness drops by more than 5%, block the release. This is the AI equivalent of unit tests — it catches regressions before your users do.
Section 10
The 7 RAG Failure Modes (And How to Fix Each)
A RAG system has many moving parts. Here are the seven most common ways it breaks in production — and the exact fix for each.
| # | Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|---|
| 1 | Wrong chunks retrieved | Answer is about the wrong topic | Chunk size too large, diluted embeddings | Smaller chunks + re-ranking |
| 2 | Relevant chunk not retrieved | Model says "I don't have info" but the doc exists | Low context recall — chunking split across boundary | Add overlap, hierarchical chunking, or HyDE |
| 3 | LLM ignores the context | Answer contradicts the retrieved chunk | Context buried in long prompt ("lost in the middle") | Put most relevant chunks first, limit to top-3 |
| 4 | Hallucination beyond context | Answer adds facts not in any chunk | Model's prior knowledge leaks in | Lower temperature, stricter system prompt: "ONLY use context" |
| 5 | Stale context | Answer is correct but outdated | Documents updated but index not refreshed | Event-driven re-indexing on document change |
| 6 | Follow-up questions fail | "Does that apply to X?" retrieves nothing useful | Pronoun/reference in query has no context for retrieval | Query rewriting with conversation history |
| 7 | Confident wrong answers | Model answers confidently when it shouldn't know | Retrieval score is low but LLM ignores it | Threshold retrieval score — below 0.7 cosine similarity, return "I don't know" |
Section 11
Full Working RAG — From Documents to Answers
Here's a complete, minimal RAG implementation in ~100 lines of Python. It uses ChromaDB (runs locally, no signup needed) so you can run this right now.
# rag_complete.py
# pip install chromadb openai pypdf sentence-transformers
import openai
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# ── Setup ────────────────────────────────────────────────────
client = openai.OpenAI()
embed_fn = OpenAIEmbeddingFunction(
api_key=client.api_key,
model_name="text-embedding-3-small"
)
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge", embedding_function=embed_fn)
# ── Indexing ─────────────────────────────────────────────────
def chunk_text(text: str, size: int = 400, overlap: int = 80) -> list[str]:
words = text.split()
chunks = []
step = size - overlap
for i in range(0, len(words), step):
chunk = " ".join(words[i : i + size])
if chunk:
chunks.append(chunk)
return chunks
def index_document(doc_id: str, text: str, source: str) -> None:
chunks = chunk_text(text)
collection.add(
ids = [f"{doc_id}_{i}" for i in range(len(chunks))],
documents = chunks,
metadatas = [{"source": source, "chunk": i} for i in range(len(chunks))]
)
print(f"Indexed {len(chunks)} chunks from {source}")
# ── Retrieval ────────────────────────────────────────────────
def retrieve(query: str, n_results: int = 5) -> list[str]:
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results["documents"][0] # list of chunk texts
# ── Generation ───────────────────────────────────────────────
SYSTEM_PROMPT = """You are a helpful assistant.
Answer questions using ONLY the context provided below.
If the answer is not in the context, say:
"I don't have information on that in my knowledge base."
Do not make up any information.
Context:
{context}"""
def answer(question: str) -> str:
chunks = retrieve(question)
context = "\n\n---\n\n".join(chunks)
response = client.chat.completions.create(
model = "gpt-4o-mini",
temperature = 0.2,
messages = [
{"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# ── Run it ───────────────────────────────────────────────────
if __name__ == "__main__":
# Index some documents
index_document("faq", open("faq.txt").read(), "FAQ")
index_document("policy", open("refund_policy.txt").read(), "Refund Policy")
# Ask questions
questions = [
"What is the refund policy?",
"How do I cancel my subscription?",
"Does the refund apply to annual plans?",
]
for q in questions:
print(f"\nQ: {q}")
print(f"A: {answer(q)}")
Create two text files — faq.txt and refund_policy.txt — with any content. Run python rag_complete.py. You now have a working RAG system. Then try: what happens if you ask about something not in the documents? You should see "I don't have information on that" — not a hallucinated answer. That's the whole point.
Section 12
What This Means for Your Interview
RAG is asked in virtually every AI engineer interview at product companies right now. Here's exactly what you need to be able to do:
Questions You Can Now Answer Cold
- "Explain how RAG works." → Walk through indexing pipeline + query pipeline. Take 90 seconds. Use the diagram in your head.
- "What's the difference between RAG and fine-tuning?" → Knowledge vs behavior. Freshness. Cost. Cite the table above.
- "How do you evaluate a RAG system?" → RAGAS: faithfulness, answer relevancy, context precision, context recall. Golden test set. Before every deploy.
- "How would you handle follow-up questions in a RAG chatbot?" → Query rewriting. Use the conversation history to make the latest message standalone before retrieval.
- "What's your chunking strategy?" → 300–500 token chunks, 10–20% overlap, structure-aware splitting. Hierarchical for precision + context.
The Design Question You'll Get
Interviewers love this one: "Design a RAG system for a 10K-person company that wants to search their internal documentation."
Your answer structure:
- Clarify: How often do docs change? Latency SLA? Scale (QPS)? Private data constraints?
- Indexing: Source connectors for Confluence/Notion/Drive → chunk → embed (text-embedding-3-small) → Qdrant with metadata (team, date, doc type)
- Retrieval: Embed query → ANN search (top-20) → cross-encoder re-rank (top-5) → inject into prompt
- Freshness: Webhook on document update → event queue → async re-index. Fingerprint unchanged docs to skip re-embedding.
- Evaluation: 100-question golden set + RAGAS in CI. Alert if faithfulness drops >5%.
- Scale: Semantic cache (Redis) for repeated queries. At 10K users, most questions cluster — cache hit rate 40–60%.
They want to see that you understand where things can go wrong. Don't just describe the happy path. Mention failure modes proactively: "One thing I'd watch for is the lost-in-the-middle problem — if the most relevant chunk is in position 8 of 10, the model may ignore it. I'd limit injection to the top 3 chunks to avoid this." That answer is what gets you the offer.