Topic 01

The ML System Design Framework

ML system design interviews are different from regular system design. You must think about: ML objective, data pipeline, feature engineering, model choice, serving, monitoring — not just load balancers and databases.

PhaseKey questions to answer
1. Problem FramingWhat is the ML task? What metric optimizes business value? What are the constraints (latency, cost, fairness)?
2. DataWhat data is available? Volume? Labels? Freshness requirements? Historical vs. real-time?
3. FeaturesWhat signals predict the target? User features, item features, context features, interaction features?
4. ModelSimple baseline first (logistic regression). When to use neural nets? Tradeoff: accuracy vs. interpretability vs. latency
5. ServingBatch vs. real-time? Latency requirements? Scale (QPS)? Caching strategy?
6. MonitoringWhat metrics to track? Data drift? Model degradation? How to trigger retraining?
Start with the business metric

Always clarify: are we optimizing for clicks (engagement), conversions (revenue), or retention (long-term value)? These give different ML objectives. A click model maximizes CTR but may show clickbait. Always discuss the proxy metric vs business metric tension.

Topic 02

Recommendation Systems (YouTube/Netflix Scale)

The recommendation problem: from millions of items, show each user the 10 most relevant ones, in real time, for millions of concurrent users. This is solved with a two-stage architecture:

Stage 1 Candidate Generation 1M → 500 items Stage 2 Ranking 500 → top 10 Output Top 10 personalized Recall problem ANN / two-tower Precision problem Deep neural net

Two-stage recommendation: retrieve candidates fast, then rank them precisely

TWO-TOWER

The two-tower model for candidate retrieval

The two-tower model encodes users and items independently into embeddings, then uses dot product for similarity. At serving time: pre-compute all item embeddings offline → build ANN index → at query time, encode user → find nearest item embeddings in milliseconds.

  • User tower: user_id, age, watch history, device type → 256-dim embedding
  • Item tower: item_id, category, creator, tags → 256-dim embedding
  • Training signal: Positive = watched, Negative = shown but not clicked
  • Serving: Faiss / ScaNN for approximate nearest neighbor search
  • Query understanding: Spell correction → query expansion → intent classification (navigational vs informational vs transactional)
  • Retrieval: TF-IDF / BM25 for sparse retrieval; dense retrieval (bi-encoder) for semantic search; hybrid = both
  • Re-ranking: Cross-encoder scores query-document pairs more accurately but slower
  • Learning to Rank (LTR): Pointwise (predict relevance score), Pairwise (rank A above B), Listwise (optimize NDCG directly)
LinkedIn job search example

User searches "senior ML engineer remote." System must: understand "senior" = seniority filter, "ML engineer" = job category, "remote" = location filter; retrieve from millions of job postings; rank by predicted application probability using user's profile + job features + contextual signals (time since posted, company size, salary range).

Topic 04

Real-time ML: Latency Matters

Feature freshness: a fraud model that uses "user's purchase behavior in the last 5 minutes" is much more accurate than one using "user's 30-day history." But computing real-time features requires a streaming infrastructure.

Feature TypeFreshnessComputed withExample
BatchHours/days oldSpark jobs, dbtUser 30-day average spend
Near-real-timeMinutes oldFlink, Kafka StreamsUser's last 10 transactions this hour
Real-timeMillisecondsIn-request computationTransaction amount vs. account balance

Topic 05

Feature Stores

The feature store is the critical infrastructure piece that prevents training-serving skew — the bug where features computed differently during training vs serving causes the model to perform worse in production than on the test set.

  • Offline store: Historical feature values (S3, BigQuery) — for training dataset generation
  • Online store: Latest feature values (Redis, DynamoDB) — for real-time inference in <10ms
  • Feature pipeline: Same code computes features for both stores — no skew possible
  • Tools: Feast (open-source), Tecton, Hopsworks, SageMaker Feature Store

Topic 06

A/B Testing ML Models

  • Shadow mode: New model runs in parallel with old, predictions logged but not shown to users. Zero risk, but can't measure user behavior changes
  • Canary deployment: Send 5% of traffic to new model. Monitor key metrics before full rollout
  • A/B test: 50/50 split, statistical significance test, run until sample size achieved
  • Multi-armed bandit: Dynamically allocate traffic to better-performing models without waiting for statistical significance — good for fast-moving metrics
  • Interleaving: Both models' recommendations are mixed for the same user, then track which model's items they click. More sensitive than traditional A/B testing
Interview tip: think out loud

In ML system design interviews, the process matters as much as the answer. Start by clarifying requirements and metrics. Draw the architecture diagram early. Explicitly state tradeoffs: "We could use a two-tower model for better recall, but a simpler collaborative filtering approach would be easier to debug and deploy faster." Interviewers want to see you reason about tradeoffs, not just recite architectures.