Table of Contents
By the end of this checkpoint, you'll have a repeatable 5-step framework you can apply to any AI system design question in an interview. You'll also have complete design blueprints for the two most common questions: enterprise RAG chatbot and code completion system.
Concept 01
The 5-Step Framework — Apply to Any AI System Design Question
Every AI system design interview question, regardless of the product, reduces to the same five decisions. Memorise this framework and apply it in order. Interviewers reward structured thinking over encyclopaedic knowledge.
| Step | Questions to answer | Why it matters |
|---|---|---|
| 1. Requirements | What queries? What data sources? Latency SLA? Cost budget? Scale? | Every design decision flows from requirements. Don't skip this. |
| 2. Retrieval Strategy | RAG, fine-tuning, or pure prompting? What chunking strategy? What vector DB? | Wrong retrieval = wrong answers. This is the core architectural decision. |
| 3. Query Pipeline | How does a user query flow through the system? What's the prompt assembly strategy? | End-to-end latency, quality, and cost depend on query pipeline design. |
| 4. Evaluation | How do you measure quality? What metrics? How do you catch regressions? | Without evaluation, you can't improve. Most candidates skip this — stand out by including it. |
| 5. Scaling & Operations | What breaks at 10x load? How do you update knowledge? Cost at scale? | Shows production thinking, not just architecture drawing. |
"Before I start drawing components, let me clarify requirements. Is this an internal enterprise tool or public-facing? What's the latency expectation — under 2 seconds for the first token? How many documents are in the knowledge base, and how often does it update? What's the monthly cost budget? ..."
Spending 2 minutes on requirements signals seniority. Junior candidates jump straight to boxes-and-arrows diagrams.
Concept 02
RAG vs Fine-Tuning vs Prompting — The Decision Tree
The single most important architectural decision in an AI system is how you give the model knowledge. Three options exist, and each has a distinct use case.
| RAG | Fine-tuning | Prompting only | |
|---|---|---|---|
| Best for | Private/proprietary knowledge, frequently updated data | Specific style, tone, or behaviour change; domain-specific reasoning | General tasks the base model already handles well |
| Update cost | Low — update vector store | High — re-train, days + GPU cost | None |
| Hallucination risk | Low (grounded in retrieved docs) | Medium (knowledge baked in weights, can drift) | High for factual queries |
| Latency overhead | +100–300ms for retrieval | None (faster inference on smaller model) | None |
| Cost | Embedding + vector DB storage + retrieval per query | GPU training (thousands of dollars) + deployment | Just inference cost |
| Example use case | Company policy chatbot, customer support over product docs | Code completion model for a specific language/codebase style | Summarisation, translation, general Q&A |
Default to RAG for knowledge-heavy tasks. Use fine-tuning only when you need behaviour/style change and have a labelled dataset of 1,000+ examples. Use prompting-only when the base model's knowledge is sufficient. Hybrid (RAG + fine-tuned retriever) for the highest-quality production systems.
Concept 03
Designing the Retrieval Layer — The Details That Separate Good from Great
The retrieval layer is where most RAG systems fail in practice. Good retrieval design requires answering five sub-questions:
1. Chunking Strategy
| Strategy | How it works | Use when |
|---|---|---|
| Fixed-size | Split every N tokens with M token overlap | Simple documents, consistent structure |
| Sentence-aware | Split on sentence boundaries, batch sentences to hit target size | Prose documents (articles, docs) |
| Recursive / structural | Split on headers, then paragraphs, then sentences | Markdown, HTML, code files with clear structure |
| Semantic | Embed sentences, split where cosine similarity drops | Long documents with clear topic changes |
2. Vector Database Choice
| DB | Best for | Managed? |
|---|---|---|
| ChromaDB | Local dev, prototypes, <1M vectors | Self-hosted or in-process |
| Pinecone | Cloud production, millions of vectors, no infra management | Yes (cloud) |
| pgvector | If you already use Postgres — one less system to operate | Via managed Postgres |
| Qdrant | Open-source, high performance, on-prem compliance needs | Self-hosted or cloud |
| Firestore Vector Search | Firebase projects — native integration, free tier available | Yes (Firebase) |
3. Hybrid Search
Pure vector search misses exact matches (product SKUs, names, error codes). Pure keyword search misses synonyms. Hybrid search combines BM25 (keyword) + vector search and re-ranks results with a cross-encoder. Use hybrid search in any production RAG system.
Concept 04
Latency Budget for AI Features
Users expect search results in under 200ms and chatbot responses in under 2 seconds for first token. LLM features must fit within these budgets. Break down the latency stack:
| Component | Typical latency | Optimization lever |
|---|---|---|
| Query embedding | 30–80ms | Cache embeddings of common queries |
| Vector search | 10–50ms | HNSW index tuning, ANN vs exact search |
| Context assembly | 5–15ms | Pre-format chunks, avoid re-processing |
| LLM API (time to first token) | 400–1200ms | Streaming, smaller model, response caching |
| Network + serialization | 20–50ms | CDN, keep connections warm |
| Total (no cache) | 465–1395ms | |
| Total (cache hit) | <50ms | Always implement response caching |
Without streaming, users wait for the full response before seeing anything. With streaming, they see the first token within 500ms — perceived as "fast." Always use StreamingResponse for chat interfaces. Buffer the stream server-side if you need to run post-generation filters.
Concept 05
Cost Architecture — Per-Query Economics
At scale, AI costs are not negligible. A system serving 100,000 queries/day can easily cost $3,000–$15,000/month depending on model choice. Design your cost architecture before you deploy.
# Cost model for a RAG chatbot (per query)
#
# Assumptions:
# - Query embedding: 50 tokens input @ gpt-text-embedding-3-small ($0.02/1M) = $0.000001
# - Context assembly: 1,500 tokens (5 chunks * 300 tokens each)
# - LLM call: 1,500 input + 500 output @ gpt-4o-mini ($0.15/$0.60 per 1M)
#
# Per-query cost:
embedding_cost = 50 / 1_000_000 * 0.02 # $0.000001
llm_input_cost = 1500 / 1_000_000 * 0.15 # $0.000225
llm_output_cost = 500 / 1_000_000 * 0.60 # $0.000300
total_per_query = embedding_cost + llm_input_cost + llm_output_cost
# = ~$0.000526 per query
# At 100,000 queries/day:
daily_cost = 100_000 * total_per_query # $52.60/day
monthly_cost = daily_cost * 30 # $1,578/month
# With 40% cache hit rate:
effective_monthly = monthly_cost * 0.60 # $946/month
# Switch to gemini-1.5-flash ($0.075 input, $0.30 output per 1M)?
# llm_input_cost = 1500/1M * 0.075 = $0.0001125
# llm_output_cost = 500/1M * 0.30 = $0.00015
# total = $0.0002635 per query → $791/month → saves $787/month vs gpt-4o-mini
Always model your cost at target scale before choosing a model. The difference between gpt-4o and gpt-4o-mini is 17x in cost — often with only minor quality difference on constrained tasks.
Concept 06
Security & Safety Layer — What Interviewers Always Ask
Every AI system design answer should include a safety layer. Interviewers specifically probe for this because most candidates omit it.
| Threat | Description | Mitigation |
|---|---|---|
| Prompt injection | User input overrides system prompt | Input validation, separator tokens, output filtering, never trust user input in system role |
| Data exfiltration | User extracts documents they shouldn't see | Row-level security in vector DB, filter retrieved chunks by user permissions before assembly |
| Jailbreaking | User bypasses content restrictions | OpenAI Moderation API, Llama Guard, or custom classifier on both input and output |
| Hallucination as fact | Model states incorrect info confidently | Faithfulness check post-generation (RAGAS), citations with source links, disclaimer for ungrounded answers |
| Denial of Service | Attacker floods expensive LLM endpoints | Rate limiting per user/IP, token budget per request, WAF rules on endpoint |
Concept 07
Design: Enterprise RAG Chatbot — Full Walkthrough
Question: "Design a chatbot that lets employees ask questions about internal company policies and documentation."
Step 1 — Requirements: Internal use, ~5,000 employees, ~10,000 policy documents (PDF/Word/HTML), documents updated monthly, latency <3s for first token, answer should cite source document, access control (employee can only see docs for their department).
Step 2 — Retrieval Strategy: RAG (knowledge is private, updated monthly — fine-tuning would require monthly re-training). Chunking: recursive/structural (documents have sections and headers). Vector DB: pgvector (company already uses Postgres, minimise new infrastructure). Hybrid search: BM25 + vector, re-rank top 10 to top 5.
Step 4 — Evaluation: Golden test set of 50 questions with known answers. Metrics: faithfulness (answer grounded in retrieved docs), answer relevance, citation accuracy. Regression gate in CI — any model or prompt change must maintain baseline scores.
Step 5 — Scaling: Vector search is fast at 10M chunks (pgvector handles this). LLM cost: ~$0.0005/query × 500 queries/day = $250/month — within budget. Monthly doc update: incremental re-index only changed documents (track doc hash).
Concept 08
Design: Code Completion System — Key Decisions
Question: "Design a GitHub Copilot-like code completion system."
Why this is different from RAG: Code completion has extreme latency requirements (<100ms for inline suggestions), a very specific context assembly problem (surrounding code, not documents), and benefits from fine-tuning on the company's codebase style — not just retrieval.
Key architectural decisions:
- Model choice: Use a code-specific model fine-tuned on your codebase (CodeLlama, DeepSeek Coder) rather than a general LLM — 10x cheaper inference, better code quality for your style.
- Context assembly (Fill-in-the-Middle): Include prefix (code above cursor), suffix (code below cursor), open file imports, and recently edited files from the same session. Not documents.
- Latency: Stream completions. Prefetch suggestions speculatively as the user types. Cache completions for common patterns. Target 50ms time-to-first-token with a local model deployment.
- Speculative decoding: Use a small draft model (1B params) to generate tokens fast, verify with a larger model. Reduces latency 2–3x for common code patterns.
- Feedback loop: Track accept/reject rate per suggestion. Low accept rate → retrain or adjust context window. This is your eval metric.
When asked to design any AI system, always include: (1) a requirements-first approach, (2) explicit RAG vs fine-tuning reasoning, (3) a latency budget breakdown, (4) per-query cost estimate, (5) a safety layer, and (6) how you'd evaluate quality. Most candidates give you 1–2 of these. Giving all 6 signals a senior engineer who has shipped AI in production.
Practice
Interview Questions for This Checkpoint
- Q: "Design an enterprise RAG chatbot." Walk through your approach.
A: Open with requirements (data sources, latency SLA, access control). Choose RAG (not fine-tuning) — private knowledge, monthly updates. Design indexing pipeline: parse → chunk (recursive) → embed → store (pgvector with dept metadata). Design query pipeline: embed query → hybrid search with dept filter → rerank → assemble context → LLM (gpt-4o-mini, streaming) → output safety check. Include evaluation (faithfulness, citation accuracy) and scaling (cost model, incremental indexing). - Q: How do you decide between RAG and fine-tuning?
A: RAG for factual/knowledge tasks with proprietary or frequently-updated data — easy to update (change the vector store), no training cost, grounded in sources. Fine-tuning for style/behaviour changes, domain-specific reasoning, or when you have 1,000+ labelled examples and need model behaviour change. They're not mutually exclusive — a fine-tuned retriever + RAG generation is often the best production system. - Q: How do you handle latency in an LLM-based system?
A: Build a latency budget: embedding (50ms) + retrieval (30ms) + LLM time-to-first-token (500ms) + network (30ms) = ~610ms. Use streaming to show first token early. Cache deterministic responses (Redis, 24hr TTL). Use cheaper/faster models on non-critical paths. Pre-warm connections. Track P95, not average — tail latency is what users experience. - Q: How do you prevent prompt injection at scale?
A: Four layers: (1) Input validation — reject inputs containing instruction-like patterns. (2) Structural separation — use roles correctly, never concatenate user input into the system message. (3) Output filtering — check response against expected format/topics. (4) Audit logging — log every call with request_id; flag responses that match injection patterns for human review. - Q: What metrics do you use to evaluate an AI system?
A: For RAG: RAGAS — faithfulness, answer relevance, context precision, context recall. For generation quality: human eval (sampled) + automated eval using a judge LLM. For production: latency (P95), cost per query, cache hit rate, error rate, user satisfaction (thumbs up/down). Always have a golden test set with known answers for regression testing.
- Using the 5-step framework, write out a complete design for a customer support chatbot for an e-commerce company. Cover all 5 steps. Time yourself — aim to produce a complete design in 20 minutes (interview pace).
- Build the cost model for your design: how much would it cost per month at 10,000 queries/day? At what point would you switch from gpt-4o-mini to gemini-1.5-flash?
- Draw the architecture diagram for your design as a whiteboard diagram (pen and paper or Excalidraw). Practice explaining each component aloud — "This is the indexing pipeline, which runs nightly to..."
- Apply the 5-step framework: Requirements → Retrieval Strategy → Query Pipeline → Evaluation → Scaling
- Default to RAG for knowledge-heavy tasks; fine-tune only for behaviour/style with labelled data
- Hybrid search (BM25 + vectors + re-ranker) consistently outperforms pure vector search in production
- Build a latency budget before choosing components — streaming buys most of your perceived speed
- Model cost at target scale before committing to a model — the difference is often 10–17x
- Always include a safety layer in your design: prompt injection, access control, output filtering
- Interviewers reward: requirements-first thinking, RAG vs fine-tune reasoning, cost modelling, evaluation strategy