What You'll Build

By the end of this checkpoint, you'll have a repeatable 5-step framework you can apply to any AI system design question in an interview. You'll also have complete design blueprints for the two most common questions: enterprise RAG chatbot and code completion system.

Concept 01

The 5-Step Framework — Apply to Any AI System Design Question

Every AI system design interview question, regardless of the product, reduces to the same five decisions. Memorise this framework and apply it in order. Interviewers reward structured thinking over encyclopaedic knowledge.

StepQuestions to answerWhy it matters
1. RequirementsWhat queries? What data sources? Latency SLA? Cost budget? Scale?Every design decision flows from requirements. Don't skip this.
2. Retrieval StrategyRAG, fine-tuning, or pure prompting? What chunking strategy? What vector DB?Wrong retrieval = wrong answers. This is the core architectural decision.
3. Query PipelineHow does a user query flow through the system? What's the prompt assembly strategy?End-to-end latency, quality, and cost depend on query pipeline design.
4. EvaluationHow do you measure quality? What metrics? How do you catch regressions?Without evaluation, you can't improve. Most candidates skip this — stand out by including it.
5. Scaling & OperationsWhat breaks at 10x load? How do you update knowledge? Cost at scale?Shows production thinking, not just architecture drawing.
How to Open a System Design Answer

"Before I start drawing components, let me clarify requirements. Is this an internal enterprise tool or public-facing? What's the latency expectation — under 2 seconds for the first token? How many documents are in the knowledge base, and how often does it update? What's the monthly cost budget? ..."

Spending 2 minutes on requirements signals seniority. Junior candidates jump straight to boxes-and-arrows diagrams.

Concept 02

RAG vs Fine-Tuning vs Prompting — The Decision Tree

The single most important architectural decision in an AI system is how you give the model knowledge. Three options exist, and each has a distinct use case.

RAGFine-tuningPrompting only
Best forPrivate/proprietary knowledge, frequently updated dataSpecific style, tone, or behaviour change; domain-specific reasoningGeneral tasks the base model already handles well
Update costLow — update vector storeHigh — re-train, days + GPU costNone
Hallucination riskLow (grounded in retrieved docs)Medium (knowledge baked in weights, can drift)High for factual queries
Latency overhead+100–300ms for retrievalNone (faster inference on smaller model)None
CostEmbedding + vector DB storage + retrieval per queryGPU training (thousands of dollars) + deploymentJust inference cost
Example use caseCompany policy chatbot, customer support over product docsCode completion model for a specific language/codebase styleSummarisation, translation, general Q&A
Decision Rule (Use This in Interviews)

Default to RAG for knowledge-heavy tasks. Use fine-tuning only when you need behaviour/style change and have a labelled dataset of 1,000+ examples. Use prompting-only when the base model's knowledge is sufficient. Hybrid (RAG + fine-tuned retriever) for the highest-quality production systems.

Concept 03

Designing the Retrieval Layer — The Details That Separate Good from Great

The retrieval layer is where most RAG systems fail in practice. Good retrieval design requires answering five sub-questions:

1. Chunking Strategy

StrategyHow it worksUse when
Fixed-sizeSplit every N tokens with M token overlapSimple documents, consistent structure
Sentence-awareSplit on sentence boundaries, batch sentences to hit target sizeProse documents (articles, docs)
Recursive / structuralSplit on headers, then paragraphs, then sentencesMarkdown, HTML, code files with clear structure
SemanticEmbed sentences, split where cosine similarity dropsLong documents with clear topic changes

2. Vector Database Choice

DBBest forManaged?
ChromaDBLocal dev, prototypes, <1M vectorsSelf-hosted or in-process
PineconeCloud production, millions of vectors, no infra managementYes (cloud)
pgvectorIf you already use Postgres — one less system to operateVia managed Postgres
QdrantOpen-source, high performance, on-prem compliance needsSelf-hosted or cloud
Firestore Vector SearchFirebase projects — native integration, free tier availableYes (Firebase)

3. Hybrid Search

Pure vector search misses exact matches (product SKUs, names, error codes). Pure keyword search misses synonyms. Hybrid search combines BM25 (keyword) + vector search and re-ranks results with a cross-encoder. Use hybrid search in any production RAG system.

Concept 04

Latency Budget for AI Features

Users expect search results in under 200ms and chatbot responses in under 2 seconds for first token. LLM features must fit within these budgets. Break down the latency stack:

ComponentTypical latencyOptimization lever
Query embedding30–80msCache embeddings of common queries
Vector search10–50msHNSW index tuning, ANN vs exact search
Context assembly5–15msPre-format chunks, avoid re-processing
LLM API (time to first token)400–1200msStreaming, smaller model, response caching
Network + serialization20–50msCDN, keep connections warm
Total (no cache)465–1395ms
Total (cache hit)<50msAlways implement response caching
Streaming is Non-Negotiable for Chatbots

Without streaming, users wait for the full response before seeing anything. With streaming, they see the first token within 500ms — perceived as "fast." Always use StreamingResponse for chat interfaces. Buffer the stream server-side if you need to run post-generation filters.

Concept 05

Cost Architecture — Per-Query Economics

At scale, AI costs are not negligible. A system serving 100,000 queries/day can easily cost $3,000–$15,000/month depending on model choice. Design your cost architecture before you deploy.

# Cost model for a RAG chatbot (per query)
#
# Assumptions:
#   - Query embedding: 50 tokens input @ gpt-text-embedding-3-small ($0.02/1M) = $0.000001
#   - Context assembly: 1,500 tokens (5 chunks * 300 tokens each)
#   - LLM call: 1,500 input + 500 output @ gpt-4o-mini ($0.15/$0.60 per 1M)
#
# Per-query cost:
embedding_cost = 50 / 1_000_000 * 0.02          # $0.000001
llm_input_cost = 1500 / 1_000_000 * 0.15        # $0.000225
llm_output_cost = 500 / 1_000_000 * 0.60        # $0.000300
total_per_query = embedding_cost + llm_input_cost + llm_output_cost
# = ~$0.000526 per query

# At 100,000 queries/day:
daily_cost = 100_000 * total_per_query           # $52.60/day
monthly_cost = daily_cost * 30                   # $1,578/month

# With 40% cache hit rate:
effective_monthly = monthly_cost * 0.60          # $946/month

# Switch to gemini-1.5-flash ($0.075 input, $0.30 output per 1M)?
# llm_input_cost = 1500/1M * 0.075 = $0.0001125
# llm_output_cost = 500/1M * 0.30 = $0.00015
# total = $0.0002635 per query → $791/month → saves $787/month vs gpt-4o-mini

Always model your cost at target scale before choosing a model. The difference between gpt-4o and gpt-4o-mini is 17x in cost — often with only minor quality difference on constrained tasks.

Concept 06

Security & Safety Layer — What Interviewers Always Ask

Every AI system design answer should include a safety layer. Interviewers specifically probe for this because most candidates omit it.

ThreatDescriptionMitigation
Prompt injectionUser input overrides system promptInput validation, separator tokens, output filtering, never trust user input in system role
Data exfiltrationUser extracts documents they shouldn't seeRow-level security in vector DB, filter retrieved chunks by user permissions before assembly
JailbreakingUser bypasses content restrictionsOpenAI Moderation API, Llama Guard, or custom classifier on both input and output
Hallucination as factModel states incorrect info confidentlyFaithfulness check post-generation (RAGAS), citations with source links, disclaimer for ungrounded answers
Denial of ServiceAttacker floods expensive LLM endpointsRate limiting per user/IP, token budget per request, WAF rules on endpoint

Concept 07

Design: Enterprise RAG Chatbot — Full Walkthrough

Question: "Design a chatbot that lets employees ask questions about internal company policies and documentation."

Step 1 — Requirements: Internal use, ~5,000 employees, ~10,000 policy documents (PDF/Word/HTML), documents updated monthly, latency <3s for first token, answer should cite source document, access control (employee can only see docs for their department).

Step 2 — Retrieval Strategy: RAG (knowledge is private, updated monthly — fine-tuning would require monthly re-training). Chunking: recursive/structural (documents have sections and headers). Vector DB: pgvector (company already uses Postgres, minimise new infrastructure). Hybrid search: BM25 + vector, re-rank top 10 to top 5.

ARCHITECTURE DIAGRAM — Enterprise RAG Chatbot [User] → [API Gateway + Rate Limiter] ↓ [Auth Service] → Fetch user dept permissions ↓ [Query Pipeline Service] ↙ ↘ [Query Embedder] [Query Rewriter] ← optional: expand query ↓ ↓ [pgvector search with dept filter] ↓ [Hybrid Reranker (BM25 + cross-encoder)] ↓ [Context Assembler] → inject top 5 chunks + citations ↓ [LLM (gpt-4o-mini)] → streaming response ↓ [Output Safety Filter] ↓ [Response + cited sources] → [User] [Indexing Pipeline — async, runs nightly] [Doc Storage] → [Parser] → [Chunker] → [Embedder] → [pgvector]

Step 4 — Evaluation: Golden test set of 50 questions with known answers. Metrics: faithfulness (answer grounded in retrieved docs), answer relevance, citation accuracy. Regression gate in CI — any model or prompt change must maintain baseline scores.

Step 5 — Scaling: Vector search is fast at 10M chunks (pgvector handles this). LLM cost: ~$0.0005/query × 500 queries/day = $250/month — within budget. Monthly doc update: incremental re-index only changed documents (track doc hash).

Concept 08

Design: Code Completion System — Key Decisions

Question: "Design a GitHub Copilot-like code completion system."

Why this is different from RAG: Code completion has extreme latency requirements (<100ms for inline suggestions), a very specific context assembly problem (surrounding code, not documents), and benefits from fine-tuning on the company's codebase style — not just retrieval.

Key architectural decisions:

  • Model choice: Use a code-specific model fine-tuned on your codebase (CodeLlama, DeepSeek Coder) rather than a general LLM — 10x cheaper inference, better code quality for your style.
  • Context assembly (Fill-in-the-Middle): Include prefix (code above cursor), suffix (code below cursor), open file imports, and recently edited files from the same session. Not documents.
  • Latency: Stream completions. Prefetch suggestions speculatively as the user types. Cache completions for common patterns. Target 50ms time-to-first-token with a local model deployment.
  • Speculative decoding: Use a small draft model (1B params) to generate tokens fast, verify with a larger model. Reduces latency 2–3x for common code patterns.
  • Feedback loop: Track accept/reject rate per suggestion. Low accept rate → retrain or adjust context window. This is your eval metric.
The One Answer That Impresses Interviewers

When asked to design any AI system, always include: (1) a requirements-first approach, (2) explicit RAG vs fine-tuning reasoning, (3) a latency budget breakdown, (4) per-query cost estimate, (5) a safety layer, and (6) how you'd evaluate quality. Most candidates give you 1–2 of these. Giving all 6 signals a senior engineer who has shipped AI in production.

Practice

Interview Questions for This Checkpoint

Expected Interview Questions
  1. Q: "Design an enterprise RAG chatbot." Walk through your approach.
    A: Open with requirements (data sources, latency SLA, access control). Choose RAG (not fine-tuning) — private knowledge, monthly updates. Design indexing pipeline: parse → chunk (recursive) → embed → store (pgvector with dept metadata). Design query pipeline: embed query → hybrid search with dept filter → rerank → assemble context → LLM (gpt-4o-mini, streaming) → output safety check. Include evaluation (faithfulness, citation accuracy) and scaling (cost model, incremental indexing).
  2. Q: How do you decide between RAG and fine-tuning?
    A: RAG for factual/knowledge tasks with proprietary or frequently-updated data — easy to update (change the vector store), no training cost, grounded in sources. Fine-tuning for style/behaviour changes, domain-specific reasoning, or when you have 1,000+ labelled examples and need model behaviour change. They're not mutually exclusive — a fine-tuned retriever + RAG generation is often the best production system.
  3. Q: How do you handle latency in an LLM-based system?
    A: Build a latency budget: embedding (50ms) + retrieval (30ms) + LLM time-to-first-token (500ms) + network (30ms) = ~610ms. Use streaming to show first token early. Cache deterministic responses (Redis, 24hr TTL). Use cheaper/faster models on non-critical paths. Pre-warm connections. Track P95, not average — tail latency is what users experience.
  4. Q: How do you prevent prompt injection at scale?
    A: Four layers: (1) Input validation — reject inputs containing instruction-like patterns. (2) Structural separation — use roles correctly, never concatenate user input into the system message. (3) Output filtering — check response against expected format/topics. (4) Audit logging — log every call with request_id; flag responses that match injection patterns for human review.
  5. Q: What metrics do you use to evaluate an AI system?
    A: For RAG: RAGAS — faithfulness, answer relevance, context precision, context recall. For generation quality: human eval (sampled) + automated eval using a judge LLM. For production: latency (P95), cost per query, cache hit rate, error rate, user satisfaction (thumbs up/down). Always have a golden test set with known answers for regression testing.
Your Turn — Exercises
  1. Using the 5-step framework, write out a complete design for a customer support chatbot for an e-commerce company. Cover all 5 steps. Time yourself — aim to produce a complete design in 20 minutes (interview pace).
  2. Build the cost model for your design: how much would it cost per month at 10,000 queries/day? At what point would you switch from gpt-4o-mini to gemini-1.5-flash?
  3. Draw the architecture diagram for your design as a whiteboard diagram (pen and paper or Excalidraw). Practice explaining each component aloud — "This is the indexing pipeline, which runs nightly to..."
CP-10 Summary
  • Apply the 5-step framework: Requirements → Retrieval Strategy → Query Pipeline → Evaluation → Scaling
  • Default to RAG for knowledge-heavy tasks; fine-tune only for behaviour/style with labelled data
  • Hybrid search (BM25 + vectors + re-ranker) consistently outperforms pure vector search in production
  • Build a latency budget before choosing components — streaming buys most of your perceived speed
  • Model cost at target scale before committing to a model — the difference is often 10–17x
  • Always include a safety layer in your design: prompt injection, access control, output filtering
  • Interviewers reward: requirements-first thinking, RAG vs fine-tune reasoning, cost modelling, evaluation strategy