Build a RAG System from Scratch

01Why RAG Exists

02RAG vs Fine-Tuning

03The Indexing Pipeline

04Chunking Strategies

05The Query Pipeline

06Re-ranking

07HyDE

08Conversation History

09RAGAS Evaluation

10Failure Modes

11Full Implementation

12Interview Prep

Imagine you're building a customer support chatbot for a SaaS product. The product has 500 help articles, a changelog that updates weekly, and pricing that changes quarterly. You ask GPT-4: "What's the refund policy?" It confidently gives you an answer — one that was true in 2023 but was updated six months ago.

This is the problem RAG solves. And it's why every serious AI product you've heard of — Notion AI, GitHub Copilot, Perplexity, every enterprise chatbot — is built on top of some version of this pattern.

Section 01

Why RAG Exists — 3 Problems It Solves

Large language models are trained on a snapshot of the internet up to a certain date. After that, they know nothing new. But that's just one of three problems:

Stale knowledge. GPT-4's training data has a cutoff. It doesn't know about your product's latest features, your company's new policies, or anything that happened last month.
Private data blindness. Your internal documents, customer data, and proprietary knowledge were never in the training set. The model literally cannot know what's in your Confluence wiki.
Hallucination. When a model doesn't know something, it doesn't say "I don't know." It generates a plausible-sounding answer. This is catastrophic in production — a customer support bot that invents refund policies is a legal liability.

RAG fixes all three by giving the model a reference library to look things up in, at query time. Instead of relying on baked-in knowledge, the model reads the relevant documents and then answers. It's the difference between a doctor who studied medicine once and a doctor who looks up the latest guidelines before prescribing.

The core insight

RAG doesn't make the model smarter. It makes the model better informed. The LLM's job shifts from "know the answer" to "read these documents and synthesize an answer." That's a much easier job — and one where hallucination is far less likely.

Section 02

RAG vs Fine-Tuning — Which Do You Need?

This is one of the most common interview questions you'll get. The answer is not "RAG is always better" — it's "it depends on what problem you're solving."

Think of it this way: fine-tuning teaches the model new skills or behaviors. RAG gives the model new knowledge to look up. A doctor who went to medical school (fine-tuning) still needs to look up the latest drug interaction database (RAG). The two are not mutually exclusive.

Dimension	RAG	Fine-Tuning
Best for	Knowledge that changes, private documents, facts	New style, tone, format, or reasoning patterns
Knowledge updates	Add a document, done. Minutes.	Retrain the model. Hours to days.
Cost	Vector DB + embedding API calls	GPU hours for training + serving costs
Transparency	Can cite which document the answer came from	Model "knows" but can't cite sources
Hallucination risk	Lower — answer is grounded in retrieved text	Higher — relies on weights, not documents
Data required	Just the documents themselves	Labelled (input, output) training examples
Use both when	Fine-tune for style/behavior + RAG for knowledge. E.g., a fine-tuned "PrepFlix assistant" persona that retrieves course-specific answers.

Common mistake

Teams often reach for fine-tuning when RAG would solve the problem in 1/10 the time and cost. Fine-tuning is the right answer only when you need the model to behave differently — new output format, domain-specific reasoning, or a persona. If you just need it to know new facts, use RAG.

Section 03

The Indexing Pipeline — Building the Knowledge Base

Before any user can ask a question, you need to prepare your documents. This is the indexing pipeline — it runs once upfront (and again whenever documents change). Think of it as building a library: you're organizing the books so they can be found quickly later.

INDEXING PIPELINE (runs offline / on new docs) Raw Documents (PDFs, .md, HTML, Notion, Confluence) | v [1] LOAD & PARSE Extract clean text, remove boilerplate | v [2] CHUNK Split into overlapping segments (400 tokens, 80 overlap) | v [3] EMBED Each chunk → 768-dim float vector (model: text-embedding-3-small or Gemini emb-004) | v [4] STORE Vector DB: Pinecone / Qdrant / Firestore Each record: {id, text, embedding, metadata} | v [5] INDEX ANN index built (HNSW) for fast similarity search

Load & parse. Extract clean text from PDFs, HTML, Markdown, Notion exports. Use libraries like pypdf, unstructured, or markdownify. Strip headers, footers, page numbers — anything that's noise, not signal.
Chunk. Split the text into smaller segments. This is more art than science — we'll spend a full section on it. The key constraint: each chunk must fit inside the embedding model's token limit (usually 512 tokens) and carry enough context to stand alone.
Embed. Pass each chunk through an embedding model. The model converts text to a dense vector — a list of ~768 floating-point numbers — that encodes the semantic meaning. Similar-meaning text → similar vectors.
Store. Save each chunk with its vector and metadata (source file, page number, date, category) in a vector database. The metadata lets you filter later — "only search docs updated in the last 30 days."
Index. The vector DB builds an HNSW (Hierarchical Navigable Small World) index over the stored vectors. This is what enables fast approximate nearest-neighbor search at query time — milliseconds instead of seconds.

Real world: incremental indexing

In production, you don't re-index everything every time. Compute a hash (MD5 or SHA-256) of each document's content. If the hash hasn't changed since last run, skip it. Only embed and upsert documents that are new or changed. This keeps your indexing pipeline fast and cheap.

Section 04

Chunking — The Most Underrated Decision in RAG

Here's a dirty secret about RAG: the quality of your answers depends more on how you chunk your documents than on which LLM you use. Bad chunking + GPT-4 gives worse results than good chunking + GPT-4o-mini. Let me explain why.

When a user asks "What's the refund policy for annual subscriptions?", your retrieval system finds the chunks whose embeddings are closest to that question. If the relevant sentence is in the middle of a 2,000-word chunk, the entire chunk's embedding is diluted — the needle is hidden in a haystack of irrelevant words. The embedding won't be close enough to retrieve it. You've already failed before the LLM sees anything.

The Four Main Strategies

Strategy	How It Works	Best For	Watch Out For
Fixed-size	Split every N tokens with M overlap. Simple, deterministic.	General docs, quick prototyping	May split mid-sentence, mid-thought
Recursive / semantic	Try to split at paragraph → sentence → word boundaries in priority order	Well-structured prose, articles	Uneven chunk sizes
Document-structure-aware	Split at headings, sections (Markdown ##, HTML <h2>)	Structured docs: wikis, READMEs, legal contracts	Sections can be too long or too short
Hierarchical (parent-child)	Small child chunks for retrieval, large parent chunk injected into LLM	When precision + context both matter	More complex to implement

Choosing Chunk Size — The 400-Token Sweet Spot

There's real tension here. Smaller chunks = more precise retrieval (the embedding focuses on one idea). Larger chunks = more context in each result (the LLM has more to work with). Here's how to think about it:

Too small (under 100 tokens): Each chunk is a sentence or two. Embeddings are precise but chunks often lack context — the LLM gets "the policy is 30 days" without knowing which product or what the exceptions are.
Too large (over 1000 tokens): Each chunk is several paragraphs. One chunk might cover the refund policy, return shipping, and exceptions all at once. The embedding is diluted. Retrieval misses more often.
Sweet spot (300–500 tokens): Roughly one complete thought — a policy section, a code function, a FAQ answer. Dense enough to retrieve, rich enough to understand.

The overlap rule

Always add 10–20% overlap between adjacent chunks. If your chunk size is 400 tokens, overlap by 40–80 tokens. This ensures that information spanning a chunk boundary (like a sentence that starts at the end of chunk 3 and continues into chunk 4) isn't lost.

The hierarchical pattern is worth knowing for interviews. You store two versions of each section: a small child chunk (100–150 tokens, precise for retrieval) and the full parent section (500–800 tokens, rich for the LLM). Retrieve using child chunks. Inject the parent into the prompt. Best of both worlds.

Section 05

The Query Pipeline — Answering in Real Time

When a user sends a question, the query pipeline runs. Unlike the indexing pipeline (which runs offline), this runs live — latency matters. Every step adds time, so you need to be deliberate about what to include.

QUERY PIPELINE (runs on every user question) User: "What's the cancellation policy?" | v [1] EMBED QUERY Same model as indexing! (text-embedding-3-small) → 768-dim query vector | v [2] VECTOR SEARCH (ANN) findNearest(query_vector, top_k=20) Returns: top 20 chunks by cosine similarity | v [3] (Optional) RE-RANK Cross-encoder scores each of 20 chunks vs query Returns: top 5, re-ordered by precision | v [4] BUILD PROMPT System prompt + top 5 chunks + user question | v [5] LLM GENERATION Model reads context, generates grounded answer | v [6] STREAM TO CLIENT Return answer + source citations

Let's walk through the code for each step. I'll use Python with the OpenAI SDK — the patterns work identically with any provider.

import openai
import numpy as np

client = openai.OpenAI()

# ── Step 1: Embed the query ──────────────────────────────────
def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# ── Step 2: Retrieve from vector DB ─────────────────────────
# (Using a hypothetical vector_db client — same pattern for
#  Pinecone, Qdrant, Weaviate, Firestore, etc.)
def retrieve(query: str, top_k: int = 20) -> list[dict]:
    query_vector = embed(query)
    results = vector_db.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )
    return results  # list of {text, metadata, score}

# ── Step 3: Generate answer ──────────────────────────────────
def rag_answer(user_question: str) -> str:
    chunks = retrieve(user_question, top_k=5)
    context = "\n\n---\n\n".join(c["text"] for c in chunks)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.2,
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant.
Answer the user's question using ONLY the context below.
If the context doesn't contain the answer, say:
'I don't have information on that in my knowledge base.'
Never make up information.

Context:
""" + context
            },
            {"role": "user", "content": user_question}
        ]
    )
    return response.choices[0].message.content

Critical: use the same embedding model for indexing and querying

If you index documents with text-embedding-3-small and query with text-embedding-3-large, you're comparing apples to oranges. The vector spaces are completely different. Retrieval will return garbage. This is a surprisingly common production bug.

Section 06

Re-ranking — When "Close Enough" Isn't Good Enough

The ANN search uses a bi-encoder: it embeds the query and each document separately, then compares the vectors. This is fast — a single lookup against a pre-built index. But bi-encoders have a weakness: they embed the query and document in isolation, without any awareness of each other.

A cross-encoder takes the query and a candidate document together and scores them jointly. It can see the full interaction between the two texts, giving much more precise relevance scores. The trade-off: you can't pre-compute cross-encoder scores (they depend on the query), so you can only run them on a small candidate set.

This is why re-ranking uses a two-stage approach:

Stage 1 — Bi-encoder retrieval: Fast ANN search, retrieve top-20 candidates. Miss rate ~5%. Cheap: one vector lookup.
Stage 2 — Cross-encoder re-ranking: Score the top-20 candidates against the query. Re-order by precision. Return top-5. Expensive: 20 forward passes. But only 20, not millions.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    # Score each candidate against the query
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    # Sort by score descending, return top N
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    return [item for item, score in ranked[:top_n]]

# Updated query pipeline:
candidates = retrieve(user_question, top_k=20)   # bi-encoder, fast
top_chunks  = rerank(user_question, candidates)   # cross-encoder, precise

When to add re-ranking

Re-ranking adds ~50–200ms latency. Skip it in the prototype. Add it when you measure that retrieval precision is hurting answer quality (RAGAS context precision below 0.75). At scale, Cohere's Rerank API is a plug-and-play alternative to running your own cross-encoder.

Section 07

HyDE — When the Query Itself Is the Problem

Here's a subtle issue with standard RAG: queries and documents live in different "spaces" in the embedding model. A user asks a short, informal question: "how do i cancel?" Your document contains a formal paragraph: "Section 4.2: Cancellation Procedures — To terminate your subscription, navigate to Account Settings..."

These two texts look different to an embedding model, even though they're clearly related. The query embedding and document embedding aren't as close as they should be.

HyDE (Hypothetical Document Embeddings) is a clever fix. Instead of embedding the user's raw question, you first ask the LLM to generate a hypothetical answer — as if it already knew. Then you embed that hypothetical answer. A hypothetical answer uses the same vocabulary, style, and structure as real documents — so it lands much closer to the actual document in embedding space.

def hyde_retrieve(user_question: str, top_k: int = 5) -> list[dict]:
    # Step 1: Generate hypothetical answer
    hyp_response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.5,
        messages=[{
            "role": "user",
            "content": f"""Write a short passage that answers the following question,
as if it were from an official help document.
Question: {user_question}
Passage:"""
        }]
    )
    hypothetical_doc = hyp_response.choices[0].message.content

    # Step 2: Embed the hypothetical answer (not the question)
    hyp_vector = embed(hypothetical_doc)

    # Step 3: Search with the hypothetical vector
    results = vector_db.query(vector=hyp_vector, top_k=top_k)
    return results

Does it work?

Yes — the original HyDE paper showed 10–20% improvement in retrieval recall on several benchmarks. The downside: you're adding one LLM call before retrieval, adding ~1–2 seconds of latency and cost. Use it selectively for short, ambiguous queries — not every query needs it.

Section 08

Conversation History — Making Multi-Turn RAG Work

Most RAG tutorials show you single-turn Q&A. But real products need conversations. And conversations break naive RAG immediately.

Here's why. The user asks: "What's the refund policy?" Your RAG retrieves the right document. Great. The user follows up: "Does that apply to annual plans too?" Your RAG embeds that question and searches — but "Does that apply to annual plans too?" has no context. What's "that"? The embedding model has no idea. Retrieval fails.

The fix is query rewriting: before embedding the user's latest message, use the LLM to rewrite it into a standalone question that includes all relevant context.

def rewrite_query(conversation_history: list, latest_message: str) -> str:
    """Turn a follow-up question into a standalone search query."""
    history_text = "\n".join([
        f"{msg['role']}: {msg['content']}"
        for msg in conversation_history[-4:]  # last 4 turns
    ])

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[{
            "role": "user",
            "content": f"""Given this conversation history:
{history_text}

Rewrite the latest message as a complete, standalone question
that can be understood without the conversation history.
Latest message: "{latest_message}"
Standalone question:"""
        }]
    )
    return response.choices[0].message.content

# Usage:
standalone_q = rewrite_query(history, "Does that apply to annual plans too?")
# → "Does the 30-day refund policy apply to annual subscription plans?"
chunks = retrieve(standalone_q)  # now retrieves correctly

Section 09

RAGAS — Measuring Whether Your RAG Actually Works

This is where most teams fail. They build the RAG system, try a few questions manually, think "looks good," and ship it. Then users complain that it gives wrong answers half the time.

Manual testing doesn't scale. You need automated metrics. RAGAS (Retrieval-Augmented Generation Assessment) is the standard framework. It gives you four key numbers:

Metric	Question It Answers	What Low Score Means	Target
Faithfulness	Is the answer supported by the retrieved context?	LLM is hallucinating beyond the context	> 0.85
Answer Relevancy	Does the answer actually address the question?	Answer is vague or off-topic	> 0.80
Context Precision	Are the retrieved chunks relevant to the question?	Retrieval is noisy — pulling irrelevant chunks	> 0.75
Context Recall	Did retrieval find all the information needed to answer?	Chunking or retrieval is missing key information	> 0.75

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

# Build a golden test set: 50-100 questions with known correct answers
test_data = {
    "question":     ["What is the refund policy?", "How do I cancel?"],
    "ground_truth": ["Refunds are available within 30 days of purchase.",
                     "Cancel via Account Settings > Subscription > Cancel."],
    "contexts":     [
        # The chunks your system retrieved for each question
        ["Our refund policy allows cancellation within 30 days..."],
        ["To cancel your subscription, go to Account Settings..."],
    ],
    "answer": [
        rag_answer("What is the refund policy?"),
        rag_answer("How do I cancel?"),
    ]
}

results = evaluate(
    Dataset.from_dict(test_data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# {'faithfulness': 0.91, 'answer_relevancy': 0.88,
#  'context_precision': 0.82, 'context_recall': 0.79}

The golden test set is your safety net

Build a set of 50–100 (question, expected_answer) pairs. Run RAGAS on it before every deploy. If faithfulness drops by more than 5%, block the release. This is the AI equivalent of unit tests — it catches regressions before your users do.

Section 10

The 7 RAG Failure Modes (And How to Fix Each)

A RAG system has many moving parts. Here are the seven most common ways it breaks in production — and the exact fix for each.

#	Failure Mode	Symptom	Root Cause	Fix
1	Wrong chunks retrieved	Answer is about the wrong topic	Chunk size too large, diluted embeddings	Smaller chunks + re-ranking
2	Relevant chunk not retrieved	Model says "I don't have info" but the doc exists	Low context recall — chunking split across boundary	Add overlap, hierarchical chunking, or HyDE
3	LLM ignores the context	Answer contradicts the retrieved chunk	Context buried in long prompt ("lost in the middle")	Put most relevant chunks first, limit to top-3
4	Hallucination beyond context	Answer adds facts not in any chunk	Model's prior knowledge leaks in	Lower temperature, stricter system prompt: "ONLY use context"
5	Stale context	Answer is correct but outdated	Documents updated but index not refreshed	Event-driven re-indexing on document change
6	Follow-up questions fail	"Does that apply to X?" retrieves nothing useful	Pronoun/reference in query has no context for retrieval	Query rewriting with conversation history
7	Confident wrong answers	Model answers confidently when it shouldn't know	Retrieval score is low but LLM ignores it	Threshold retrieval score — below 0.7 cosine similarity, return "I don't know"

Section 11

Full Working RAG — From Documents to Answers

Here's a complete, minimal RAG implementation in ~100 lines of Python. It uses ChromaDB (runs locally, no signup needed) so you can run this right now.

# rag_complete.py
# pip install chromadb openai pypdf sentence-transformers

import openai
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# ── Setup ────────────────────────────────────────────────────
client      = openai.OpenAI()
embed_fn    = OpenAIEmbeddingFunction(
                  api_key=client.api_key,
                  model_name="text-embedding-3-small"
              )
chroma      = chromadb.Client()
collection  = chroma.create_collection("knowledge", embedding_function=embed_fn)

# ── Indexing ─────────────────────────────────────────────────
def chunk_text(text: str, size: int = 400, overlap: int = 80) -> list[str]:
    words  = text.split()
    chunks = []
    step   = size - overlap
    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + size])
        if chunk:
            chunks.append(chunk)
    return chunks

def index_document(doc_id: str, text: str, source: str) -> None:
    chunks = chunk_text(text)
    collection.add(
        ids       = [f"{doc_id}_{i}" for i in range(len(chunks))],
        documents = chunks,
        metadatas = [{"source": source, "chunk": i} for i in range(len(chunks))]
    )
    print(f"Indexed {len(chunks)} chunks from {source}")

# ── Retrieval ────────────────────────────────────────────────
def retrieve(query: str, n_results: int = 5) -> list[str]:
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results["documents"][0]   # list of chunk texts

# ── Generation ───────────────────────────────────────────────
SYSTEM_PROMPT = """You are a helpful assistant.
Answer questions using ONLY the context provided below.
If the answer is not in the context, say:
"I don't have information on that in my knowledge base."
Do not make up any information.

Context:
{context}"""

def answer(question: str) -> str:
    chunks   = retrieve(question)
    context  = "\n\n---\n\n".join(chunks)
    response = client.chat.completions.create(
        model       = "gpt-4o-mini",
        temperature = 0.2,
        messages    = [
            {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
            {"role": "user",   "content": question}
        ]
    )
    return response.choices[0].message.content

# ── Run it ───────────────────────────────────────────────────
if __name__ == "__main__":
    # Index some documents
    index_document("faq", open("faq.txt").read(), "FAQ")
    index_document("policy", open("refund_policy.txt").read(), "Refund Policy")

    # Ask questions
    questions = [
        "What is the refund policy?",
        "How do I cancel my subscription?",
        "Does the refund apply to annual plans?",
    ]
    for q in questions:
        print(f"\nQ: {q}")
        print(f"A: {answer(q)}")

Run this today

Create two text files — faq.txt and refund_policy.txt — with any content. Run python rag_complete.py. You now have a working RAG system. Then try: what happens if you ask about something not in the documents? You should see "I don't have information on that" — not a hallucinated answer. That's the whole point.

Section 12

What This Means for Your Interview

RAG is asked in virtually every AI engineer interview at product companies right now. Here's exactly what you need to be able to do:

Questions You Can Now Answer Cold

"Explain how RAG works." → Walk through indexing pipeline + query pipeline. Take 90 seconds. Use the diagram in your head.
"What's the difference between RAG and fine-tuning?" → Knowledge vs behavior. Freshness. Cost. Cite the table above.
"How do you evaluate a RAG system?" → RAGAS: faithfulness, answer relevancy, context precision, context recall. Golden test set. Before every deploy.
"How would you handle follow-up questions in a RAG chatbot?" → Query rewriting. Use the conversation history to make the latest message standalone before retrieval.
"What's your chunking strategy?" → 300–500 token chunks, 10–20% overlap, structure-aware splitting. Hierarchical for precision + context.

The Design Question You'll Get

Interviewers love this one: "Design a RAG system for a 10K-person company that wants to search their internal documentation."

Your answer structure:

Clarify: How often do docs change? Latency SLA? Scale (QPS)? Private data constraints?
Indexing: Source connectors for Confluence/Notion/Drive → chunk → embed (text-embedding-3-small) → Qdrant with metadata (team, date, doc type)
Retrieval: Embed query → ANN search (top-20) → cross-encoder re-rank (top-5) → inject into prompt
Freshness: Webhook on document update → event queue → async re-index. Fingerprint unchanged docs to skip re-embedding.
Evaluation: 100-question golden set + RAGAS in CI. Alert if faithfulness drops >5%.
Scale: Semantic cache (Redis) for repeated queries. At 10K users, most questions cluster — cache hit rate 40–60%.

The one thing interviewers test for

They want to see that you understand where things can go wrong. Don't just describe the happy path. Mention failure modes proactively: "One thing I'd watch for is the lost-in-the-middle problem — if the most relevant chunk is in position 8 of 10, the model may ignore it. I'd limit injection to the top 3 chunks to avoid this." That answer is what gets you the offer.

Build a RAG System from Scratch
The Complete Guide for Software Engineers

Contents