AI System Design Interview: ML Systems at Scale

01ML System Design Framework

02Recommendation Systems

03Search & Ranking

04Real-time ML

05Feature Stores

06A/B Testing ML Models

Topic 01

The ML System Design Framework

ML system design interviews are different from regular system design. You must think about: ML objective, data pipeline, feature engineering, model choice, serving, monitoring — not just load balancers and databases.

Phase	Key questions to answer
1. Problem Framing	What is the ML task? What metric optimizes business value? What are the constraints (latency, cost, fairness)?
2. Data	What data is available? Volume? Labels? Freshness requirements? Historical vs. real-time?
3. Features	What signals predict the target? User features, item features, context features, interaction features?
4. Model	Simple baseline first (logistic regression). When to use neural nets? Tradeoff: accuracy vs. interpretability vs. latency
5. Serving	Batch vs. real-time? Latency requirements? Scale (QPS)? Caching strategy?
6. Monitoring	What metrics to track? Data drift? Model degradation? How to trigger retraining?

Start with the business metric

Always clarify: are we optimizing for clicks (engagement), conversions (revenue), or retention (long-term value)? These give different ML objectives. A click model maximizes CTR but may show clickbait. Always discuss the proxy metric vs business metric tension.

Topic 02

Recommendation Systems (YouTube/Netflix Scale)

The recommendation problem: from millions of items, show each user the 10 most relevant ones, in real time, for millions of concurrent users. This is solved with a two-stage architecture:

Two-stage recommendation: retrieve candidates fast, then rank them precisely

TWO-TOWER

The two-tower model for candidate retrieval

The two-tower model encodes users and items independently into embeddings, then uses dot product for similarity. At serving time: pre-compute all item embeddings offline → build ANN index → at query time, encode user → find nearest item embeddings in milliseconds.

User tower: user_id, age, watch history, device type → 256-dim embedding
Item tower: item_id, category, creator, tags → 256-dim embedding
Training signal: Positive = watched, Negative = shown but not clicked
Serving: Faiss / ScaNN for approximate nearest neighbor search

Topic 03

Search & Ranking Systems

Query understanding: Spell correction → query expansion → intent classification (navigational vs informational vs transactional)
Retrieval: TF-IDF / BM25 for sparse retrieval; dense retrieval (bi-encoder) for semantic search; hybrid = both
Re-ranking: Cross-encoder scores query-document pairs more accurately but slower
Learning to Rank (LTR): Pointwise (predict relevance score), Pairwise (rank A above B), Listwise (optimize NDCG directly)

LinkedIn job search example

User searches "senior ML engineer remote." System must: understand "senior" = seniority filter, "ML engineer" = job category, "remote" = location filter; retrieve from millions of job postings; rank by predicted application probability using user's profile + job features + contextual signals (time since posted, company size, salary range).

Topic 04

Real-time ML: Latency Matters

Feature freshness: a fraud model that uses "user's purchase behavior in the last 5 minutes" is much more accurate than one using "user's 30-day history." But computing real-time features requires a streaming infrastructure.

Feature Type	Freshness	Computed with	Example
Batch	Hours/days old	Spark jobs, dbt	User 30-day average spend
Near-real-time	Minutes old	Flink, Kafka Streams	User's last 10 transactions this hour
Real-time	Milliseconds	In-request computation	Transaction amount vs. account balance

Topic 05

Feature Stores

The feature store is the critical infrastructure piece that prevents training-serving skew — the bug where features computed differently during training vs serving causes the model to perform worse in production than on the test set.

Offline store: Historical feature values (S3, BigQuery) — for training dataset generation
Online store: Latest feature values (Redis, DynamoDB) — for real-time inference in <10ms
Feature pipeline: Same code computes features for both stores — no skew possible
Tools: Feast (open-source), Tecton, Hopsworks, SageMaker Feature Store

Topic 06

A/B Testing ML Models

Shadow mode: New model runs in parallel with old, predictions logged but not shown to users. Zero risk, but can't measure user behavior changes
Canary deployment: Send 5% of traffic to new model. Monitor key metrics before full rollout
A/B test: 50/50 split, statistical significance test, run until sample size achieved
Multi-armed bandit: Dynamically allocate traffic to better-performing models without waiting for statistical significance — good for fast-moving metrics
Interleaving: Both models' recommendations are mixed for the same user, then track which model's items they click. More sensitive than traditional A/B testing

Interview tip: think out loud

In ML system design interviews, the process matters as much as the answer. Start by clarifying requirements and metrics. Draw the architecture diagram early. Explicitly state tradeoffs: "We could use a two-tower model for better recall, but a simpler collaborative filtering approach would be easier to debug and deploy faster." Interviewers want to see you reason about tradeoffs, not just recite architectures.

AI System Design Interview
Designing ML Systems at Scale — Complete Guide

Contents