RAG Systems: The Pattern That Makes LLMs Useful for Your App

01Why RAG? The 3 Problems It Solves 02RAG Architecture Diagram 03Indexing Pipeline 04Chunking Strategies 05Query Pipeline 06Complete Working RAG System 07Evaluation Metrics 088 Common RAG Failures + Fixes

Concept 01

Why RAG? The 3 Problems It Solves

Without RAG, using LLMs in your application means accepting three serious problems:

Problem 1: Hallucination. LLMs confidently generate plausible-sounding text that's factually wrong. Ask GPT-4 about your company's refund policy and it will make something up that sounds reasonable but is completely wrong. The model has no idea what your policy actually says — it just pattern-matches to what refund policies typically say.

Problem 2: Private data blindness. LLMs are trained on public internet data. They know nothing about your internal documentation, customer conversations, product manuals, or proprietary database contents. Fine-tuning helps but is expensive, slow to update, and still doesn't let you update the model with today's data.

Problem 3: Context window limits. Even with a 200k token context window, you can't stuff your entire company knowledge base into every API call. It's too expensive, too slow, and the model degrades in quality when the context is full of irrelevant information.

RAG solves all three: it retrieves only the relevant documents for each query and injects them as context. The LLM is now grounded in real, current, private data — and the context window is filled with relevant information, not noise.

Concept 02

RAG Architecture — Two Phases

RAG SYSTEM ARCHITECTURE ═══════════════════════════════════════════════════════ PHASE 1: INDEXING (runs once, offline) ────────────────────────────────────── Documents (PDFs, HTML, text) │ ▼ [LOAD] → Load raw text from files/databases/APIs │ ▼ [CHUNK] → Split into ~500 token overlapping chunks │ ▼ [EMBED] → Convert each chunk to a 1536-dim vector │ ▼ [STORE] → Save vectors + text to ChromaDB/Pinecone PHASE 2: QUERYING (runs per user query, real-time) ────────────────────────────────────────────────── User Question: "What is your return policy?" │ ▼ [EMBED QUERY] → Convert question to vector │ ▼ [RETRIEVE] → Find top-3 similar chunks in DB │ (cosine similarity search) ▼ [AUGMENT PROMPT] → Inject retrieved chunks as context │ System: "Answer based on these docs:" │ Context: [chunk1] [chunk2] [chunk3] │ User: "What is your return policy?" ▼ [GENERATE] → LLM generates grounded answer │ ▼ Answer: "Based on the documentation, returns are accepted within 30 days with receipt."

Concept 03

The Indexing Pipeline — Load, Chunk, Embed, Store

import os
import chromadb
from pathlib import Path
from openai import OpenAI

client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./rag_db")

# --- STEP 1: LOAD ---

def load_text_file(file_path: str) -> str:
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

def load_documents_from_directory(directory: str) -> list[dict]:
    """Load all .txt and .md files from a directory."""
    docs = []
    for file_path in Path(directory).rglob("*.txt"):
        text = load_text_file(str(file_path))
        docs.append({
            "text": text,
            "source": str(file_path),
            "filename": file_path.name,
        })
    return docs

# --- STEP 2: CHUNK ---

def chunk_document(
    text: str,
    chunk_size: int = 500,       # tokens (~400 words)
    chunk_overlap: int = 50,      # overlap tokens for context continuity
) -> list[str]:
    """
    Split document into overlapping chunks by word count.
    For production, use sentence-aware chunking to avoid mid-sentence splits.
    """
    words = text.split()
    chunks = []
    step = chunk_size - chunk_overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)

    return chunks

# --- STEP 3 & 4: EMBED AND STORE ---

def get_embedding_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[t.replace("\n", " ") for t in texts],
    )
    return [item.embedding for item in response.data]

def build_index(documents_dir: str, collection_name: str = "knowledge_base"):
    """
    Complete indexing pipeline: Load → Chunk → Embed → Store.
    """
    # Load documents
    raw_docs = load_documents_from_directory(documents_dir)
    print(f"Loaded {len(raw_docs)} documents")

    # Get or create collection
    try:
        chroma_client.delete_collection(collection_name)
    except Exception:
        pass

    collection = chroma_client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},
    )

    # Process each document
    total_chunks = 0
    for doc in raw_docs:
        chunks = chunk_document(doc["text"])
        print(f"  {doc['filename']}: {len(chunks)} chunks")

        # Embed and store in batches of 50
        batch_size = 50
        for i in range(0, len(chunks), batch_size):
            batch_chunks = chunks[i:i + batch_size]
            batch_embeddings = get_embedding_batch(batch_chunks)

            # Generate unique IDs for each chunk
            ids = [f"{doc['filename']}_chunk_{i + j}" for j in range(len(batch_chunks))]
            metadatas = [{"source": doc["source"], "filename": doc["filename"]}
                         for _ in batch_chunks]

            collection.add(
                documents=batch_chunks,
                embeddings=batch_embeddings,
                metadatas=metadatas,
                ids=ids,
            )

            total_chunks += len(batch_chunks)

    print(f"\nIndex built: {total_chunks} total chunks stored")
    return collection

Concept 04

Chunking Strategies — The Choice That Matters Most

Strategy	How it works	Pros	Cons	Best for
Fixed size	Split every N tokens/words	Simple, predictable	Splits mid-sentence	Prototypes, homogeneous text
Sentence-aware	Split at sentence boundaries	No broken sentences	Variable chunk sizes	Articles, documentation
Semantic	Group semantically related sentences	Best retrieval quality	Expensive (embeds during chunking)	High-quality production RAG
Hierarchical	Parent doc + child chunks	Retrieves detail but returns parent	Complex indexing	Long technical documents

import re
from typing import List

def chunk_by_sentences(
    text: str,
    sentences_per_chunk: int = 5,
    overlap_sentences: int = 1,
) -> List[str]:
    """
    Sentence-aware chunking.
    Never splits mid-sentence. Maintains context with overlap.
    """
    # Split into sentences
    sentence_endings = r'(?<=[.!?])\s+'
    sentences = re.split(sentence_endings, text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]

    chunks = []
    step = sentences_per_chunk - overlap_sentences

    for i in range(0, len(sentences), step):
        chunk_sentences = sentences[i:i + sentences_per_chunk]
        if chunk_sentences:
            chunks.append(" ".join(chunk_sentences))

    return chunks

Chunking is the #1 RAG Quality Factor

Poor chunking is the most common reason RAG systems give bad answers. Too small: chunks lack context to answer anything. Too large: irrelevant text dilutes the relevant part. Ideal for most use cases: 300–500 tokens with 10–15% overlap. If your retrieved chunks don't actually contain the answer, your LLM can't generate it.

Concept 05

The Query Pipeline — Embed, Retrieve, Augment, Generate

def retrieve_context(
    query: str,
    collection: chromadb.Collection,
    top_k: int = 5,
    min_similarity: float = 0.7,
) -> list[dict]:
    """
    Retrieve relevant chunks for a query.
    Filters out low-similarity results.
    """
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query.replace("\n", " "),
    ).data[0].embedding

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )

    # Convert distances to similarities (ChromaDB cosine distance = 1 - cosine_similarity)
    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        similarity = 1 - dist
        if similarity >= min_similarity:
            chunks.append({
                "text": doc,
                "source": meta.get("filename", "unknown"),
                "similarity": similarity,
            })

    return chunks

def build_augmented_prompt(query: str, retrieved_chunks: list[dict]) -> str:
    """
    Inject retrieved context into the prompt.
    Clear source attribution reduces hallucination.
    """
    if not retrieved_chunks:
        return query

    context_parts = []
    for i, chunk in enumerate(retrieved_chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['source']}]\n{chunk['text']}"
        )

    context = "\n\n".join(context_parts)

    return f"""Use the following context to answer the question.
If the answer is not in the context, say "I don't have information about that."
Do not make up information.

CONTEXT:
{context}

QUESTION: {query}"""

Concept 06

The Complete Working RAG Pipeline

class RAGSystem:
    """
    Complete production-ready RAG system.
    Supports indexing, querying, and source citation.
    """

    SYSTEM_PROMPT = """You are a helpful assistant that answers questions based strictly
on provided context documents. Follow these rules:
1. Only use information from the provided context
2. If the context doesn't contain the answer, say so explicitly
3. Always cite which source document your answer comes from
4. Never fabricate information or draw on general knowledge"""

    def __init__(self, collection_name: str = "knowledge_base"):
        self.collection_name = collection_name
        self._collection = None

    @property
    def collection(self):
        if self._collection is None:
            self._collection = chroma_client.get_collection(self.collection_name)
        return self._collection

    def index(self, documents_dir: str):
        """Index all documents in directory."""
        self._collection = build_index(documents_dir, self.collection_name)

    def query(
        self,
        question: str,
        top_k: int = 5,
        min_similarity: float = 0.65,
        model: str = "gpt-4o-mini",
    ) -> dict:
        """
        Full RAG query: retrieve context → augment prompt → generate answer.
        Returns answer, sources, and retrieved chunks for debugging.
        """
        # Step 1: Retrieve relevant context
        retrieved_chunks = retrieve_context(
            question,
            self.collection,
            top_k=top_k,
            min_similarity=min_similarity,
        )

        if not retrieved_chunks:
            return {
                "answer": "I couldn't find relevant information in the knowledge base to answer this question.",
                "sources": [],
                "chunks_retrieved": 0,
            }

        # Step 2: Build augmented prompt
        augmented_query = build_augmented_prompt(question, retrieved_chunks)

        # Step 3: Generate answer
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user", "content": augmented_query},
            ],
            temperature=0,  # Deterministic for factual Q&A
        )

        answer = response.choices[0].message.content

        return {
            "answer": answer,
            "sources": list(set(c["source"] for c in retrieved_chunks)),
            "chunks_retrieved": len(retrieved_chunks),
            "top_similarity": max(c["similarity"] for c in retrieved_chunks),
            "retrieved_chunks": retrieved_chunks,  # For debugging
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            }
        }

    def chat(self, conversation_history: list[dict], new_question: str) -> dict:
        """
        RAG with conversation history.
        Uses the full conversation for better query understanding.
        """
        # Build context-aware query from conversation history
        if conversation_history:
            history_text = "\n".join([
                f"{m['role'].upper()}: {m['content']}"
                for m in conversation_history[-4:]  # Last 2 turns
            ])
            contextualized_query = f"""Given this conversation:
{history_text}

Answer this follow-up question: {new_question}"""
        else:
            contextualized_query = new_question

        return self.query(contextualized_query)


# Complete usage example
rag = RAGSystem()

# Index your documents (do once)
rag.index("./company_docs")

# Query
result = rag.query("What is the company's return policy for software products?")
print("Answer:", result["answer"])
print("Sources:", result["sources"])
print(f"Retrieved {result['chunks_retrieved']} chunks")
print(f"Top similarity: {result['top_similarity']:.3f}")

Concept 07

Evaluating RAG — How to Know If It's Actually Working

RAG systems need systematic evaluation. Vibe-checking is not enough. Here are the key metrics and how to measure them:

Metric	What it measures	Target	How to measure
Context Precision	Of retrieved chunks, what % were actually relevant?	> 0.8	LLM-as-judge: "Is this chunk relevant to the question?"
Context Recall	Did retrieval find all relevant chunks?	> 0.7	Requires ground-truth annotations
Answer Faithfulness	Is the answer supported by retrieved context?	> 0.9	LLM-as-judge: "Is every claim in the answer supported by context?"
Answer Relevance	Does the answer actually address the question?	> 0.8	LLM-as-judge or human eval

def evaluate_rag_faithfulness(
    question: str,
    answer: str,
    retrieved_context: list[str],
) -> dict:
    """
    LLM-as-judge for answer faithfulness.
    Checks if the answer makes claims not supported by the context.
    """
    context_text = "\n\n".join(retrieved_context)

    eval_prompt = f"""You are evaluating a RAG system's answer for faithfulness.

CONTEXT DOCUMENTS:
{context_text}

QUESTION: {question}
ANSWER: {answer}

Evaluate: Does every factual claim in the ANSWER appear in the CONTEXT DOCUMENTS?

Return JSON:
{{
  "faithful": true/false,
  "score": 0.0-1.0,
  "unsupported_claims": ["claim1", "claim2"] or [],
  "reasoning": "brief explanation"
}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    return json.loads(response.choices[0].message.content)

Concept 08

8 Common RAG Failures and How to Fix Them

Failure 1: Retrieval returns wrong chunks

Symptom: Correct answer exists in your docs but LLM says "I don't have that information."
Fix: Check similarity scores. If top result scores below 0.7, your chunking is breaking up the relevant passage. Try larger chunks or sentence-aware chunking.

Failure 2: LLM ignores retrieved context

Symptom: Retrieved chunks contain the answer but LLM answers from training data.
Fix: Strengthen your system prompt: "You MUST only use the provided context. Never draw on training knowledge." Also reduce temperature to 0.

Failure 3: Chunks too small — no context in each chunk

Symptom: Retrieved chunks are semantically close but don't contain complete answers.
Fix: Increase chunk size from 200 → 500 tokens. Add overlap. Consider parent-child chunking: embed small chunks but retrieve their parent document.

Failure 4: Embedding model mismatch

Symptom: You re-index but existing queries don't work anymore.
Fix: Always embed queries with the same model used to embed documents. Store the model name in collection metadata and validate on startup.

Failure 5: Stale index

Symptom: Documents have been updated but RAG still returns old information.
Fix: Implement incremental indexing — track document modification timestamps and re-embed changed files. Or rebuild the full index on a schedule.

Failure 6: Multi-hop questions fail

Symptom: "What is the CEO's email?" fails because CEO name and email are in different chunks.
Fix: Query decomposition — break complex questions into sub-queries, retrieve for each, then synthesize.

Failure 7: top_k too low or too high

Symptom: Too low (k=1): misses relevant context. Too high (k=20): context window fills with noise.
Fix: Use k=5 as default. Add similarity threshold (0.65+) to filter low-relevance results. Monitor average similarity of retrieved chunks.

Failure 8: No source citation

Symptom: Users can't verify answers; trust degrades over time.
Fix: Always include source attribution in your system prompt. Return source filenames in your API response. Let users click through to the original document.

CP-06 Summary

RAG = Retrieval + Augmented + Generation: retrieve relevant docs, inject as context, generate grounded answer
Two phases: indexing (offline, one-time) and querying (real-time, per request)
Chunking strategy is the #1 quality factor — use 300-500 tokens with 10-15% overlap
Always include a minimum similarity threshold to avoid injecting irrelevant context
Evaluate with faithfulness (answer supported by context?) and relevance (answer addresses question?)
Most RAG failures are retrieval failures, not generation failures — fix the retrieval first

RAG Systems: The Pattern That Makes LLMs Useful for Your App
Stop hallucinations. Ground your AI in real data.

Table of Contents