Topic 01

The 4 Interview Types at AI/ML Roles

TypeWhat they testHow to prepare
ML TheoryFundamentals: bias-variance, regularization, backprop, attention, evaluation metricsThis module + flashcards + teach-it-back method
Coding (DSA)Arrays, graphs, trees, DP — same as SWE rolesLeetCode medium (150 questions minimum)
ML System DesignDesign recommendation, search, fraud detection at scaleModule 9 + Chip Huyen's "Designing ML Systems" book
BehavioralLeadership, project failures, cross-team collaborationSTAR stories from your real experience

Topic 02

Top 25 ML Theory Questions & Answers

Q1: What's the difference between supervised, unsupervised, and self-supervised learning?
Supervised: Labeled data (X, y) — classification, regression. Unsupervised: No labels — clustering, dimensionality reduction, anomaly detection. Self-supervised: Labels generated from the data itself — BERT predicts masked words, GPT predicts next token. Self-supervised is how LLMs are pretrained on unlabeled text.
Q2: Explain the bias-variance tradeoff
Total error = Bias² + Variance + Irreducible noise. Bias = systematic error (underfitting: model too simple). Variance = sensitivity to training data fluctuations (overfitting: model too complex). Can't reduce both simultaneously — increasing model complexity reduces bias but increases variance. Regularization, ensembles, and cross-validation help find the sweet spot.
Q3: When would you use L1 vs L2 regularization?
L2 (Ridge): Penalizes large weights (w²). Shrinks all weights toward zero but never to exactly zero. Use when all features are potentially useful. L1 (Lasso): Penalizes absolute weights (|w|). Produces sparse solutions — drives irrelevant feature weights to exactly zero. Use for feature selection when you have many features and suspect many are irrelevant. Elastic Net: Combination of both.
Q4: What is cross-validation and why do we use it?
Cross-validation estimates model performance on unseen data without wasting test data. In k-fold CV: split data into k folds, train on k-1 folds, evaluate on 1 fold, rotate, average results. Use when: dataset is small (can't afford a large hold-out set), comparing multiple models. Never use CV score to evaluate final model — use a true held-out test set.
Q5: How does gradient descent work?
Gradient descent minimizes a loss function by iteratively moving in the direction of the negative gradient: w = w - α × ∇L(w). The gradient tells us the direction of steepest increase; we go the opposite direction. Learning rate α controls step size. Variants: Batch GD (entire dataset per step), SGD (one sample per step), Mini-batch GD (typical in deep learning — balance of noise and speed).
Q6: What is the vanishing gradient problem?
In deep networks, gradients are backpropagated by multiplying derivatives at each layer (chain rule). If these derivatives are <1 (sigmoid saturates), multiplying 50 values <1 gives a near-zero gradient — early layers learn nothing. Solutions: ReLU (derivative = 1 for positive inputs), skip connections (ResNet), batch normalization, LSTM gates.
Q7: Explain the attention mechanism in 60 seconds
Attention lets each position in a sequence directly attend to every other position. For each token (Query), we compute similarity with all other tokens (Keys), softmax to get weights, and compute weighted sum of Values. Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. This gives any token direct access to any other token — unlike RNNs where information must travel through hidden states sequentially.
Q8: What is overfitting and how do you prevent it?
Overfitting: model memorizes training data, performs poorly on test data. High train accuracy, low test accuracy = overfitting. Prevention: Regularization (L1/L2), Dropout (randomly zero out neurons during training), Early stopping (monitor validation loss), Data augmentation, Cross-validation, Reduce model complexity, Get more data.
Q9: When would you use precision vs recall vs F1?
Precision = TP/(TP+FP): when false positives are costly (spam filter — don't mark legitimate email as spam). Recall = TP/(TP+FN): when false negatives are costly (cancer detection — don't miss a real cancer). F1 = harmonic mean of precision and recall: when classes are imbalanced and you need to balance both. AUC-ROC: threshold-independent, overall ranking quality.
Q10: What is data leakage and how do you detect it?
Data leakage: information from the future or test set accidentally enters training. Results in suspiciously high train/validation accuracy that collapses in production. Examples: using "days_until_churn" as a feature to predict churn (only know this after churn happens), normalizing before train/test split (test mean leaks into training). Detection: if model performs perfectly on validation but fails in production, suspect leakage.
Q11: Explain bagging vs boosting
Bagging (Bootstrap Aggregating): train many models independently on random subsets of data, average predictions. Reduces variance. Example: Random Forest. Boosting: train models sequentially, each focusing on previous model's errors. Reduces bias. Examples: XGBoost, AdaBoost, LightGBM. Boosting usually achieves better accuracy; bagging is more robust to noise.
Q12: How does BERT differ from GPT?
BERT: Encoder-only, bidirectional (sees full context), trained with Masked Language Modeling (predict masked tokens). Best for understanding tasks: classification, NER, Q&A. GPT: Decoder-only, autoregressive (left-to-right), trained to predict next token. Best for generation tasks. BERT is better at sentence understanding; GPT is better at text generation.
Q13: How do you handle imbalanced datasets?
1) Resample: oversample minority class (SMOTE), undersample majority class. 2) Class weights: increase loss weight for minority class in training. 3) Threshold tuning: lower decision threshold for minority class. 4) Evaluation: never use accuracy — use F1, AUC-ROC, or precision-recall curve. 5) Collect more minority class data if possible. Choice depends on dataset size and how imbalanced it is.
Q14: What is the difference between generative and discriminative models?
Discriminative: Models P(y|x) directly — learns the decision boundary between classes. Examples: logistic regression, SVM, neural networks. Generative: Models the joint distribution P(x, y) or P(x) — learns how data is generated. Examples: Naive Bayes, GANs, VAEs, diffusion models. Discriminative models are usually more accurate for classification; generative models can generate new data and handle missing features better.
Q15: What is LoRA and when would you use it?
LoRA (Low-Rank Adaptation) freezes pretrained model weights and injects trainable low-rank decomposition matrices (ΔW = A×B, rank r≪d) into attention layers. Trains 0.1-1% of parameters, reducing GPU memory 10-50×. Use when: fine-tuning LLMs for specific tasks/styles, limited GPU budget, need multiple task-specific adapters (swap without reloading base model). Don't use for domain knowledge injection — use RAG instead.

Topic 03

Python & SQL Patterns for ML Roles

Python patterns you must know
  • List comprehension: [f(x) for x in data if condition(x)]
  • Dictionary comprehension: {k: v for k, v in pairs}
  • Pandas groupby: df.groupby('category')['value'].agg(['mean', 'count'])
  • NumPy vectorization: Replace Python loops with np.where(), np.sum(axis=1)
  • Scikit-learn pipeline: Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
  • Train-test split: Always use stratify=y for classification tasks
SQL patterns for data science interviews
  • Window functions: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC)
  • Running totals: SUM(amount) OVER (PARTITION BY user ORDER BY date)
  • Self-join for cohort analysis: Join users table to events on user_id
  • CASE WHEN: CASE WHEN score > 0.5 THEN 'positive' ELSE 'negative' END
  • Retention query: Find users active in week 1 who were also active in week 2

Topic 04

Behavioral Interviews & the STAR Framework

STAR = Situation, Task, Action, Result. Every behavioral answer follows this structure. Prepare 8-10 STAR stories from your experience. Each story can be adapted to multiple questions.

Common questionStory angle to use
Tell me about a time you dealt with a model failure in productionTechnical crisis → root cause analysis → systematic fix → monitoring improvement
How did you influence a team without authority?Data quality issue → built dashboard to show impact → stakeholders aligned → fixed pipeline
Describe a time you had to make a decision with incomplete dataLaunched model with limited data → A/B test → iterated based on results
Tell me about your biggest technical failureData leakage bug → inflated metrics → caught in production → prevention system built
How did you prioritize competing projects?Impact × effort matrix → stakeholder alignment → delivered highest value first

Topic 05

Mock Interview Strategy

  • Think out loud: Interviewers can't read your mind. Narrate your reasoning, even when stuck
  • "I don't know" is fine: "I'm not certain, but my intuition is... Let me reason through it..." shows better thinking than a wrong confident answer
  • Ask clarifying questions: "Are we optimizing for latency or throughput?" "What scale are we talking about?" This shows senior thinking
  • Start simple: Always propose the simplest solution first, then discuss improvements. Interviewers want to see your thought process, not just the final answer
  • Check your assumptions: "I'm assuming [X] — is that correct?" prevents going down the wrong path for 20 minutes
Common mistakes
  • Jumping to complex models before establishing a baseline
  • Ignoring evaluation metrics and class imbalance
  • Not discussing tradeoffs (accuracy vs. latency, precision vs. recall)
  • Forgetting to mention monitoring and retraining in system design
  • Giving textbook answers without connecting to real experience

Topic 06

30-Day Interview Prep Plan

WeekFocusDaily activities
Week 1ML FundamentalsStudy bias-variance, regularization, evaluation metrics. Implement linear/logistic regression from scratch. 2 LeetCode problems/day
Week 2Deep Learning & NLPStudy neural networks, attention, transformers. Fine-tune a Hugging Face model. 2 LeetCode problems/day
Week 3System DesignStudy recommendation systems, search ranking. Design 2 ML systems per day. Read Chip Huyen's blog posts
Week 4Mock InterviewsDaily mock interviews (ML theory + system design). Record yourself. Review and iterate. Practice behavioral stories
Final checklist before interviews
  • Can you explain gradient descent, backprop, and attention in 2 minutes each? (whiteboard practice)
  • Have you memorized the confusion matrix and derived precision, recall, F1 from it?
  • Can you design a recommendation system architecture from scratch in 45 minutes?
  • Do you have 8 STAR stories ready, each under 3 minutes?
  • Have you done at least 5 full mock interviews with a friend or on interviewing.io?
  • Do you have questions ready to ask the interviewer? (Shows genuine interest)