RAG — Retrieval-Augmented Generation

Why RAG Exists — 3 Problems It Solves CRITICAL

Problem 1: Knowledge Cutoff
LLMs have a training cutoff date. GPT-4's training ended in early 2024. It can't answer about events after that — or about your real-time data.

Problem 2: Hallucination
LLMs generate confident but wrong answers when they don't know something. There's no built-in "I don't know" — the model guesses fluently.

Problem 3: Private/Internal Data
LLMs were trained on public internet data. They have zero knowledge of your company's internal docs, codebase, or proprietary databases.

RAG Solution:

At query time, retrieve the relevant documents from your own database and inject them into the LLM's context as grounding material.

The LLM now answers from the retrieved facts, not from its (potentially stale or absent) memory.

Benefits over fine-tuning for knowledge:
✓ Always current — update the index, not the model
✓ Auditable — you can see exactly what was retrieved
✓ Access control — filter by user permissions
✓ No training cost — works with any LLM API
✓ Citable — show users the source documents

The Indexing Pipeline (runs offline, once or periodically) CRITICAL

1

Load Documents
Ingest from all sources: PDFs, web pages, databases, Notion, Confluence, Slack. Use connectors (LangChain DocumentLoaders or custom scripts). Handle different formats: text extraction, OCR for scanned PDFs, table parsing.

2

Chunk Documents
Split each document into smaller pieces. Size and strategy matter enormously — see Chunking Strategies section below. Store chunk metadata: source URL, document title, author, date, section.

3

Generate Embeddings
Call your embedding model API (Gemini text-embedding-004, OpenAI, etc.) for each chunk. Produces a 768-3072 dimension float vector per chunk. Batch API calls to stay within rate limits.

4

Store in Vector Database
Store: { text: chunk_content, embedding: [0.02, -0.44, …], metadata: {source, date, author, category} }. The vector DB builds the HNSW index for fast ANN search. Incremental updates: detect changed docs, re-embed only those chunks.

The Query Pipeline (runs real-time, per user request) CRITICAL

1

Receive User Question
If the question is a follow-up (e.g., "What about its performance?"), it lacks context. Rewrite it to include conversation history BEFORE embedding: "What is the performance of sliding window algorithm?" See Conversation History section.

2

Embed the Question
Use the SAME embedding model used during indexing. Mismatch in embedding model = misaligned vector spaces = bad retrieval. This is a common production bug.

3

Vector Search → Top-k Chunks
ANN search retrieves the top 20 most similar chunks. Apply metadata filters first if needed (date range, category, user permissions). Consider hybrid search (BM25 + vector) for better recall.

4

Re-rank (optional but recommended)
Cross-encoder re-ranks the top 20 candidates by jointly encoding (query + chunk) together. More accurate than bi-encoder. Only runs on 20 candidates, not millions. Pick top 5 for context.

5

Assemble Prompt
[System instructions] + [top-5 retrieved chunks] + [conversation history last 3 turns] + [user question]. Watch for context length limits — don't stuff too many chunks.

6

Generate Answer
LLM generates answer grounded in retrieved chunks. System prompt must say: "Answer ONLY using the provided context. Never make up information. Cite which document each claim comes from."

Chunking Strategies CRITICAL

How you split documents before embedding is one of the most impactful decisions in your RAG pipeline. Poor chunking → irrelevant retrieval → bad answers.

Strategy	How It Works	Best For	Tradeoff
Fixed-size	Split every N tokens/chars, with M overlap	Simple baseline, good starting point	Can split mid-sentence, mid-concept
Recursive character	Try paragraph → sentence → word boundaries in order	General purpose — LangChain's default. Works well for most text.	Variable chunk sizes, slightly more complex
Sentence-based	Split on sentence boundaries (periods, newlines)	Cleaner semantics, good for structured text	Variable sizes, very short sentences lose context
Semantic	Embed sentences, split where cosine similarity drops (topic change)	Highest quality, topic-coherent chunks	Expensive (calls embedding model per sentence), slowest
Hierarchical (parent-child)	Small chunks (child) for precise retrieval; return large parent chunk to LLM	Best precision + context. Legal docs, research papers.	More complex to implement, need to store parent-child links

Rules of Thumb

Too small (< 100 tokens)Chunk misses context → bad answers even with good retrieval

Too large (> 1000 tokens)Noise dilutes relevant part → LLM pays less attention to key info

Sweet spot256-512 tokens with 10-20% overlap

Overlap purposePrevents losing context at chunk boundaries

Interview Q: "Same chunk size for a 200-page legal PDF and a product catalog?"

Legal PDF: larger chunks (512-1024 tokens). A legal clause loses meaning when split mid-paragraph. Use recursive splitting with semantic awareness.

Product catalog: smaller chunks (128-256 tokens). Each product entry is an atomic self-contained fact. Minimal overlap needed.

Retrieval Improvements HIGH

Re-ranking with Cross-Encoder

Problem: Bi-encoder (embeddings) retrieves 20 candidates quickly but isn't precise enough.

Solution: Cross-encoder reads both query and chunk together, producing a much more accurate relevance score.

Fast ANN search: retrieve top-20 candidates (bi-encoder)
Cross-encoder scores each (query, candidate) pair
Re-sort by cross-encoder score, take top-5

Cross-encoder is slower but runs only on 20 results, not millions. The latency cost is small; the quality gain is large.

HyDE (Hypothetical Document Embeddings)

Problem: Short user queries and long document chunks are in different semantic spaces — mismatch hurts retrieval.

Solution:

Use LLM to generate a hypothetical answer to the query
Embed the hypothetical answer (not the query)
Search with that embedding — better match to document style

HyDE adds one LLM call latency. Worth it for knowledge-dense domains where query-document style mismatch is significant (legal, medical).

Metadata Filtering

Filter by date, category, author, or user permissions before vector search. Critical for enterprise RAG where access control matters. "Only search documents from Q4 2024" → filter by date metadata.

Conversation History — The Follow-Up Question Problem HIGH

The problem: User asks "What is binary search?" then follows with "What's its time complexity?"
If you embed "What's its time complexity?" as-is and search your vector DB — you'll retrieve irrelevant chunks, because the query has no context about binary search.

The fix: Before embedding, rewrite the user's question to be standalone using conversation history.

# Query rewriting with conversation history def rewrite_query(conversation_history, current_question, llm): history_text = "\n".join([ f"{m['role'].capitalize()}: {m['content']}" for m in conversation_history[-3:] # last 3 turns ]) prompt = f"""Given the conversation history below, rewrite the follow-up question to be a standalone question that contains all necessary context. Conversation: {history_text} Follow-up question: {current_question} Standalone question:""" return llm(prompt) # "What's its time complexity?" # → "What is the time complexity of binary search?"

RAG Evaluation with RAGAS HIGH

Metric	What It Measures	Bad Score Means	Target
Faithfulness	Is every claim in the answer supported by the retrieved context? Is the LLM making things up beyond what was retrieved?	LLM is hallucinating — going beyond what was in the retrieved chunks	> 0.85
Answer Relevance	Does the answer actually address the question asked?	Answer is grounded but off-topic or incomplete	> 0.80
Context Precision	Of the retrieved chunks, what fraction are actually relevant to the question?	Noisy retrieval — irrelevant chunks confuse the LLM	> 0.75
Context Recall	Do the retrieved chunks cover all aspects needed to fully answer the question?	Missing information — relevant chunks weren't retrieved	> 0.75

Process: Build a ground-truth dataset of 50-100 (question, expected_answer) pairs. Measure RAGAS baseline. Make a pipeline change (chunk size, embedding model, re-ranking). Re-measure. Only deploy changes that improve scores. This is how production RAG is iterated.

Common RAG Failure Modes & Fixes CRITICAL

Failure Mode	Symptom	Fix
Bad chunking	Answers miss part of the information despite it being in the document	Use recursive character splitting. Increase overlap. Try hierarchical chunking.
Wrong embedding model	Retrieves topically unrelated chunks. Low Context Precision.	Upgrade to a better embedding model. MUST use same model for indexing and querying.
Context stuffing	Relevant chunk is retrieved but buried in noise; LLM ignores it	Re-rank before passing to LLM. Reduce from top-20 to top-5. Put most relevant chunk first.
No metadata	Can't filter by recency or access level. Old documents contaminate results.	Always store date, author, category, user_permissions as metadata at indexing time.
Query-document mismatch	Short queries don't match long document embeddings. Low Context Recall.	Use HyDE (generate hypothetical answer, embed that). Or use asymmetric models (separate query/doc encoders).
Missing conversation context	Follow-up questions retrieve irrelevant chunks	Rewrite follow-up questions to be standalone before embedding. Include last 3 conversation turns.
Stale index	Answers based on outdated documents	Set up incremental indexing: detect document changes (webhook, polling, or version tracking), re-embed only changed chunks.

Retrieval-Augmented Generation (RAG)

Re-ranking with Cross-Encoder

HyDE (Hypothetical Document Embeddings)

Metadata Filtering