AI App Architecture: System Design for AI Products

01The 5-Step Framework 02RAG vs Fine-Tuning vs Prompting 03Designing the Retrieval Layer 04Latency Budget for AI Features 05Cost Architecture 06Security & Safety Layer 07Design: Enterprise RAG Chatbot 08Design: Code Completion System

What You'll Build

By the end of this checkpoint, you'll have a repeatable 5-step framework you can apply to any AI system design question in an interview. You'll also have complete design blueprints for the two most common questions: enterprise RAG chatbot and code completion system.

Concept 01

The 5-Step Framework — Apply to Any AI System Design Question

Every AI system design interview question, regardless of the product, reduces to the same five decisions. Memorise this framework and apply it in order. Interviewers reward structured thinking over encyclopaedic knowledge.

Step	Questions to answer	Why it matters
1. Requirements	What queries? What data sources? Latency SLA? Cost budget? Scale?	Every design decision flows from requirements. Don't skip this.
2. Retrieval Strategy	RAG, fine-tuning, or pure prompting? What chunking strategy? What vector DB?	Wrong retrieval = wrong answers. This is the core architectural decision.
3. Query Pipeline	How does a user query flow through the system? What's the prompt assembly strategy?	End-to-end latency, quality, and cost depend on query pipeline design.
4. Evaluation	How do you measure quality? What metrics? How do you catch regressions?	Without evaluation, you can't improve. Most candidates skip this — stand out by including it.
5. Scaling & Operations	What breaks at 10x load? How do you update knowledge? Cost at scale?	Shows production thinking, not just architecture drawing.

How to Open a System Design Answer

"Before I start drawing components, let me clarify requirements. Is this an internal enterprise tool or public-facing? What's the latency expectation — under 2 seconds for the first token? How many documents are in the knowledge base, and how often does it update? What's the monthly cost budget? ..."

Spending 2 minutes on requirements signals seniority. Junior candidates jump straight to boxes-and-arrows diagrams.

Concept 02

RAG vs Fine-Tuning vs Prompting — The Decision Tree

The single most important architectural decision in an AI system is how you give the model knowledge. Three options exist, and each has a distinct use case.

	RAG	Fine-tuning	Prompting only
Best for	Private/proprietary knowledge, frequently updated data	Specific style, tone, or behaviour change; domain-specific reasoning	General tasks the base model already handles well
Update cost	Low — update vector store	High — re-train, days + GPU cost	None
Hallucination risk	Low (grounded in retrieved docs)	Medium (knowledge baked in weights, can drift)	High for factual queries
Latency overhead	+100–300ms for retrieval	None (faster inference on smaller model)	None
Cost	Embedding + vector DB storage + retrieval per query	GPU training (thousands of dollars) + deployment	Just inference cost
Example use case	Company policy chatbot, customer support over product docs	Code completion model for a specific language/codebase style	Summarisation, translation, general Q&A

Decision Rule (Use This in Interviews)

Default to RAG for knowledge-heavy tasks. Use fine-tuning only when you need behaviour/style change and have a labelled dataset of 1,000+ examples. Use prompting-only when the base model's knowledge is sufficient. Hybrid (RAG + fine-tuned retriever) for the highest-quality production systems.

Concept 03

Designing the Retrieval Layer — The Details That Separate Good from Great

The retrieval layer is where most RAG systems fail in practice. Good retrieval design requires answering five sub-questions:

1. Chunking Strategy

Strategy	How it works	Use when
Fixed-size	Split every N tokens with M token overlap	Simple documents, consistent structure
Sentence-aware	Split on sentence boundaries, batch sentences to hit target size	Prose documents (articles, docs)
Recursive / structural	Split on headers, then paragraphs, then sentences	Markdown, HTML, code files with clear structure
Semantic	Embed sentences, split where cosine similarity drops	Long documents with clear topic changes

2. Vector Database Choice

DB	Best for	Managed?
ChromaDB	Local dev, prototypes, <1M vectors	Self-hosted or in-process
Pinecone	Cloud production, millions of vectors, no infra management	Yes (cloud)
pgvector	If you already use Postgres — one less system to operate	Via managed Postgres
Qdrant	Open-source, high performance, on-prem compliance needs	Self-hosted or cloud
Firestore Vector Search	Firebase projects — native integration, free tier available	Yes (Firebase)

3. Hybrid Search

Pure vector search misses exact matches (product SKUs, names, error codes). Pure keyword search misses synonyms. Hybrid search combines BM25 (keyword) + vector search and re-ranks results with a cross-encoder. Use hybrid search in any production RAG system.

Concept 04

Latency Budget for AI Features

Users expect search results in under 200ms and chatbot responses in under 2 seconds for first token. LLM features must fit within these budgets. Break down the latency stack:

Component	Typical latency	Optimization lever
Query embedding	30–80ms	Cache embeddings of common queries
Vector search	10–50ms	HNSW index tuning, ANN vs exact search
Context assembly	5–15ms	Pre-format chunks, avoid re-processing
LLM API (time to first token)	400–1200ms	Streaming, smaller model, response caching
Network + serialization	20–50ms	CDN, keep connections warm
Total (no cache)	465–1395ms
Total (cache hit)	<50ms	Always implement response caching

Streaming is Non-Negotiable for Chatbots

Without streaming, users wait for the full response before seeing anything. With streaming, they see the first token within 500ms — perceived as "fast." Always use StreamingResponse for chat interfaces. Buffer the stream server-side if you need to run post-generation filters.

Concept 05

Cost Architecture — Per-Query Economics

At scale, AI costs are not negligible. A system serving 100,000 queries/day can easily cost $3,000–$15,000/month depending on model choice. Design your cost architecture before you deploy.

# Cost model for a RAG chatbot (per query)
#
# Assumptions:
#   - Query embedding: 50 tokens input @ gpt-text-embedding-3-small ($0.02/1M) = $0.000001
#   - Context assembly: 1,500 tokens (5 chunks * 300 tokens each)
#   - LLM call: 1,500 input + 500 output @ gpt-4o-mini ($0.15/$0.60 per 1M)
#
# Per-query cost:
embedding_cost = 50 / 1_000_000 * 0.02          # $0.000001
llm_input_cost = 1500 / 1_000_000 * 0.15        # $0.000225
llm_output_cost = 500 / 1_000_000 * 0.60        # $0.000300
total_per_query = embedding_cost + llm_input_cost + llm_output_cost
# = ~$0.000526 per query

# At 100,000 queries/day:
daily_cost = 100_000 * total_per_query           # $52.60/day
monthly_cost = daily_cost * 30                   # $1,578/month

# With 40% cache hit rate:
effective_monthly = monthly_cost * 0.60          # $946/month

# Switch to gemini-1.5-flash ($0.075 input, $0.30 output per 1M)?
# llm_input_cost = 1500/1M * 0.075 = $0.0001125
# llm_output_cost = 500/1M * 0.30 = $0.00015
# total = $0.0002635 per query → $791/month → saves $787/month vs gpt-4o-mini

Always model your cost at target scale before choosing a model. The difference between gpt-4o and gpt-4o-mini is 17x in cost — often with only minor quality difference on constrained tasks.

Concept 06

Security & Safety Layer — What Interviewers Always Ask

Every AI system design answer should include a safety layer. Interviewers specifically probe for this because most candidates omit it.

Threat	Description	Mitigation
Prompt injection	User input overrides system prompt	Input validation, separator tokens, output filtering, never trust user input in system role
Data exfiltration	User extracts documents they shouldn't see	Row-level security in vector DB, filter retrieved chunks by user permissions before assembly
Jailbreaking	User bypasses content restrictions	OpenAI Moderation API, Llama Guard, or custom classifier on both input and output
Hallucination as fact	Model states incorrect info confidently	Faithfulness check post-generation (RAGAS), citations with source links, disclaimer for ungrounded answers
Denial of Service	Attacker floods expensive LLM endpoints	Rate limiting per user/IP, token budget per request, WAF rules on endpoint

Concept 07

Design: Enterprise RAG Chatbot — Full Walkthrough

Question: "Design a chatbot that lets employees ask questions about internal company policies and documentation."

Step 1 — Requirements: Internal use, ~5,000 employees, ~10,000 policy documents (PDF/Word/HTML), documents updated monthly, latency <3s for first token, answer should cite source document, access control (employee can only see docs for their department).

Step 2 — Retrieval Strategy: RAG (knowledge is private, updated monthly — fine-tuning would require monthly re-training). Chunking: recursive/structural (documents have sections and headers). Vector DB: pgvector (company already uses Postgres, minimise new infrastructure). Hybrid search: BM25 + vector, re-rank top 10 to top 5.

ARCHITECTURE DIAGRAM — Enterprise RAG Chatbot [User] → [API Gateway + Rate Limiter] ↓ [Auth Service] → Fetch user dept permissions ↓ [Query Pipeline Service] ↙ ↘ [Query Embedder] [Query Rewriter] ← optional: expand query ↓ ↓ [pgvector search with dept filter] ↓ [Hybrid Reranker (BM25 + cross-encoder)] ↓ [Context Assembler] → inject top 5 chunks + citations ↓ [LLM (gpt-4o-mini)] → streaming response ↓ [Output Safety Filter] ↓ [Response + cited sources] → [User] [Indexing Pipeline — async, runs nightly] [Doc Storage] → [Parser] → [Chunker] → [Embedder] → [pgvector]

Step 4 — Evaluation: Golden test set of 50 questions with known answers. Metrics: faithfulness (answer grounded in retrieved docs), answer relevance, citation accuracy. Regression gate in CI — any model or prompt change must maintain baseline scores.

Step 5 — Scaling: Vector search is fast at 10M chunks (pgvector handles this). LLM cost: ~$0.0005/query × 500 queries/day = $250/month — within budget. Monthly doc update: incremental re-index only changed documents (track doc hash).

Concept 08

Design: Code Completion System — Key Decisions

Question: "Design a GitHub Copilot-like code completion system."

Why this is different from RAG: Code completion has extreme latency requirements (<100ms for inline suggestions), a very specific context assembly problem (surrounding code, not documents), and benefits from fine-tuning on the company's codebase style — not just retrieval.

Key architectural decisions:

Model choice: Use a code-specific model fine-tuned on your codebase (CodeLlama, DeepSeek Coder) rather than a general LLM — 10x cheaper inference, better code quality for your style.
Context assembly (Fill-in-the-Middle): Include prefix (code above cursor), suffix (code below cursor), open file imports, and recently edited files from the same session. Not documents.
Latency: Stream completions. Prefetch suggestions speculatively as the user types. Cache completions for common patterns. Target 50ms time-to-first-token with a local model deployment.
Speculative decoding: Use a small draft model (1B params) to generate tokens fast, verify with a larger model. Reduces latency 2–3x for common code patterns.
Feedback loop: Track accept/reject rate per suggestion. Low accept rate → retrain or adjust context window. This is your eval metric.

The One Answer That Impresses Interviewers

When asked to design any AI system, always include: (1) a requirements-first approach, (2) explicit RAG vs fine-tuning reasoning, (3) a latency budget breakdown, (4) per-query cost estimate, (5) a safety layer, and (6) how you'd evaluate quality. Most candidates give you 1–2 of these. Giving all 6 signals a senior engineer who has shipped AI in production.

Practice

Interview Questions for This Checkpoint

Expected Interview Questions

Q: "Design an enterprise RAG chatbot." Walk through your approach.
A: Open with requirements (data sources, latency SLA, access control). Choose RAG (not fine-tuning) — private knowledge, monthly updates. Design indexing pipeline: parse → chunk (recursive) → embed → store (pgvector with dept metadata). Design query pipeline: embed query → hybrid search with dept filter → rerank → assemble context → LLM (gpt-4o-mini, streaming) → output safety check. Include evaluation (faithfulness, citation accuracy) and scaling (cost model, incremental indexing).
Q: How do you decide between RAG and fine-tuning?
A: RAG for factual/knowledge tasks with proprietary or frequently-updated data — easy to update (change the vector store), no training cost, grounded in sources. Fine-tuning for style/behaviour changes, domain-specific reasoning, or when you have 1,000+ labelled examples and need model behaviour change. They're not mutually exclusive — a fine-tuned retriever + RAG generation is often the best production system.
Q: How do you handle latency in an LLM-based system?
A: Build a latency budget: embedding (50ms) + retrieval (30ms) + LLM time-to-first-token (500ms) + network (30ms) = ~610ms. Use streaming to show first token early. Cache deterministic responses (Redis, 24hr TTL). Use cheaper/faster models on non-critical paths. Pre-warm connections. Track P95, not average — tail latency is what users experience.
Q: How do you prevent prompt injection at scale?
A: Four layers: (1) Input validation — reject inputs containing instruction-like patterns. (2) Structural separation — use roles correctly, never concatenate user input into the system message. (3) Output filtering — check response against expected format/topics. (4) Audit logging — log every call with request_id; flag responses that match injection patterns for human review.
Q: What metrics do you use to evaluate an AI system?
A: For RAG: RAGAS — faithfulness, answer relevance, context precision, context recall. For generation quality: human eval (sampled) + automated eval using a judge LLM. For production: latency (P95), cost per query, cache hit rate, error rate, user satisfaction (thumbs up/down). Always have a golden test set with known answers for regression testing.

Your Turn — Exercises

Using the 5-step framework, write out a complete design for a customer support chatbot for an e-commerce company. Cover all 5 steps. Time yourself — aim to produce a complete design in 20 minutes (interview pace).
Build the cost model for your design: how much would it cost per month at 10,000 queries/day? At what point would you switch from gpt-4o-mini to gemini-1.5-flash?
Draw the architecture diagram for your design as a whiteboard diagram (pen and paper or Excalidraw). Practice explaining each component aloud — "This is the indexing pipeline, which runs nightly to..."

CP-10 Summary

Apply the 5-step framework: Requirements → Retrieval Strategy → Query Pipeline → Evaluation → Scaling
Default to RAG for knowledge-heavy tasks; fine-tune only for behaviour/style with labelled data
Hybrid search (BM25 + vectors + re-ranker) consistently outperforms pure vector search in production
Build a latency budget before choosing components — streaming buys most of your perceived speed
Model cost at target scale before committing to a model — the difference is often 10–17x
Always include a safety layer in your design: prompt injection, access control, output filtering
Interviewers reward: requirements-first thinking, RAG vs fine-tune reasoning, cost modelling, evaluation strategy

← CP-09: Production AI Capstone Project →

AI App Architecture: System Design for AI-Powered Products
A reusable framework for any GenAI system design question.

Table of Contents