Contents
Topic 01
The ML System Design Framework
ML system design interviews are different from regular system design. You must think about: ML objective, data pipeline, feature engineering, model choice, serving, monitoring — not just load balancers and databases.
| Phase | Key questions to answer |
|---|---|
| 1. Problem Framing | What is the ML task? What metric optimizes business value? What are the constraints (latency, cost, fairness)? |
| 2. Data | What data is available? Volume? Labels? Freshness requirements? Historical vs. real-time? |
| 3. Features | What signals predict the target? User features, item features, context features, interaction features? |
| 4. Model | Simple baseline first (logistic regression). When to use neural nets? Tradeoff: accuracy vs. interpretability vs. latency |
| 5. Serving | Batch vs. real-time? Latency requirements? Scale (QPS)? Caching strategy? |
| 6. Monitoring | What metrics to track? Data drift? Model degradation? How to trigger retraining? |
Always clarify: are we optimizing for clicks (engagement), conversions (revenue), or retention (long-term value)? These give different ML objectives. A click model maximizes CTR but may show clickbait. Always discuss the proxy metric vs business metric tension.
Topic 02
Recommendation Systems (YouTube/Netflix Scale)
The recommendation problem: from millions of items, show each user the 10 most relevant ones, in real time, for millions of concurrent users. This is solved with a two-stage architecture:
Two-stage recommendation: retrieve candidates fast, then rank them precisely
The two-tower model for candidate retrieval
The two-tower model encodes users and items independently into embeddings, then uses dot product for similarity. At serving time: pre-compute all item embeddings offline → build ANN index → at query time, encode user → find nearest item embeddings in milliseconds.
- User tower: user_id, age, watch history, device type → 256-dim embedding
- Item tower: item_id, category, creator, tags → 256-dim embedding
- Training signal: Positive = watched, Negative = shown but not clicked
- Serving: Faiss / ScaNN for approximate nearest neighbor search
Topic 03
Search & Ranking Systems
- Query understanding: Spell correction → query expansion → intent classification (navigational vs informational vs transactional)
- Retrieval: TF-IDF / BM25 for sparse retrieval; dense retrieval (bi-encoder) for semantic search; hybrid = both
- Re-ranking: Cross-encoder scores query-document pairs more accurately but slower
- Learning to Rank (LTR): Pointwise (predict relevance score), Pairwise (rank A above B), Listwise (optimize NDCG directly)
User searches "senior ML engineer remote." System must: understand "senior" = seniority filter, "ML engineer" = job category, "remote" = location filter; retrieve from millions of job postings; rank by predicted application probability using user's profile + job features + contextual signals (time since posted, company size, salary range).
Topic 04
Real-time ML: Latency Matters
Feature freshness: a fraud model that uses "user's purchase behavior in the last 5 minutes" is much more accurate than one using "user's 30-day history." But computing real-time features requires a streaming infrastructure.
| Feature Type | Freshness | Computed with | Example |
|---|---|---|---|
| Batch | Hours/days old | Spark jobs, dbt | User 30-day average spend |
| Near-real-time | Minutes old | Flink, Kafka Streams | User's last 10 transactions this hour |
| Real-time | Milliseconds | In-request computation | Transaction amount vs. account balance |
Topic 05
Feature Stores
The feature store is the critical infrastructure piece that prevents training-serving skew — the bug where features computed differently during training vs serving causes the model to perform worse in production than on the test set.
- Offline store: Historical feature values (S3, BigQuery) — for training dataset generation
- Online store: Latest feature values (Redis, DynamoDB) — for real-time inference in <10ms
- Feature pipeline: Same code computes features for both stores — no skew possible
- Tools: Feast (open-source), Tecton, Hopsworks, SageMaker Feature Store
Topic 06
A/B Testing ML Models
- Shadow mode: New model runs in parallel with old, predictions logged but not shown to users. Zero risk, but can't measure user behavior changes
- Canary deployment: Send 5% of traffic to new model. Monitor key metrics before full rollout
- A/B test: 50/50 split, statistical significance test, run until sample size achieved
- Multi-armed bandit: Dynamically allocate traffic to better-performing models without waiting for statistical significance — good for fast-moving metrics
- Interleaving: Both models' recommendations are mixed for the same user, then track which model's items they click. More sensitive than traditional A/B testing
In ML system design interviews, the process matters as much as the answer. Start by clarifying requirements and metrics. Draw the architecture diagram early. Explicitly state tradeoffs: "We could use a two-tower model for better recall, but a simpler collaborative filtering approach would be easier to debug and deploy faster." Interviewers want to see you reason about tradeoffs, not just recite architectures.