The most important topic for SDEs building AI products in 2025. Know every part of this pipeline. It's asked in almost every AI engineer interview.
How you split documents before embedding is one of the most impactful decisions in your RAG pipeline. Poor chunking → irrelevant retrieval → bad answers.
| Strategy | How It Works | Best For | Tradeoff |
|---|---|---|---|
| Fixed-size | Split every N tokens/chars, with M overlap | Simple baseline, good starting point | Can split mid-sentence, mid-concept |
| Recursive character | Try paragraph → sentence → word boundaries in order | General purpose — LangChain's default. Works well for most text. | Variable chunk sizes, slightly more complex |
| Sentence-based | Split on sentence boundaries (periods, newlines) | Cleaner semantics, good for structured text | Variable sizes, very short sentences lose context |
| Semantic | Embed sentences, split where cosine similarity drops (topic change) | Highest quality, topic-coherent chunks | Expensive (calls embedding model per sentence), slowest |
| Hierarchical (parent-child) | Small chunks (child) for precise retrieval; return large parent chunk to LLM | Best precision + context. Legal docs, research papers. | More complex to implement, need to store parent-child links |
Problem: Bi-encoder (embeddings) retrieves 20 candidates quickly but isn't precise enough.
Solution: Cross-encoder reads both query and chunk together, producing a much more accurate relevance score.
Problem: Short user queries and long document chunks are in different semantic spaces — mismatch hurts retrieval.
Solution:
Filter by date, category, author, or user permissions before vector search. Critical for enterprise RAG where access control matters. "Only search documents from Q4 2024" → filter by date metadata.
| Metric | What It Measures | Bad Score Means | Target |
|---|---|---|---|
| Faithfulness | Is every claim in the answer supported by the retrieved context? Is the LLM making things up beyond what was retrieved? | LLM is hallucinating — going beyond what was in the retrieved chunks | > 0.85 |
| Answer Relevance | Does the answer actually address the question asked? | Answer is grounded but off-topic or incomplete | > 0.80 |
| Context Precision | Of the retrieved chunks, what fraction are actually relevant to the question? | Noisy retrieval — irrelevant chunks confuse the LLM | > 0.75 |
| Context Recall | Do the retrieved chunks cover all aspects needed to fully answer the question? | Missing information — relevant chunks weren't retrieved | > 0.75 |
| Failure Mode | Symptom | Fix |
|---|---|---|
| Bad chunking | Answers miss part of the information despite it being in the document | Use recursive character splitting. Increase overlap. Try hierarchical chunking. |
| Wrong embedding model | Retrieves topically unrelated chunks. Low Context Precision. | Upgrade to a better embedding model. MUST use same model for indexing and querying. |
| Context stuffing | Relevant chunk is retrieved but buried in noise; LLM ignores it | Re-rank before passing to LLM. Reduce from top-20 to top-5. Put most relevant chunk first. |
| No metadata | Can't filter by recency or access level. Old documents contaminate results. | Always store date, author, category, user_permissions as metadata at indexing time. |
| Query-document mismatch | Short queries don't match long document embeddings. Low Context Recall. | Use HyDE (generate hypothetical answer, embed that). Or use asymmetric models (separate query/doc encoders). |
| Missing conversation context | Follow-up questions retrieve irrelevant chunks | Rewrite follow-up questions to be standalone before embedding. Include last 3 conversation turns. |
| Stale index | Answers based on outdated documents | Set up incremental indexing: detect document changes (webhook, polling, or version tracking), re-embed only changed chunks. |