Contents
Topic 01
The 4 Interview Types at AI/ML Roles
| Type | What they test | How to prepare |
|---|---|---|
| ML Theory | Fundamentals: bias-variance, regularization, backprop, attention, evaluation metrics | This module + flashcards + teach-it-back method |
| Coding (DSA) | Arrays, graphs, trees, DP — same as SWE roles | LeetCode medium (150 questions minimum) |
| ML System Design | Design recommendation, search, fraud detection at scale | Module 9 + Chip Huyen's "Designing ML Systems" book |
| Behavioral | Leadership, project failures, cross-team collaboration | STAR stories from your real experience |
Topic 02
Top 25 ML Theory Questions & Answers
Q1: What's the difference between supervised, unsupervised, and self-supervised learning?
Supervised: Labeled data (X, y) — classification, regression. Unsupervised: No labels — clustering, dimensionality reduction, anomaly detection. Self-supervised: Labels generated from the data itself — BERT predicts masked words, GPT predicts next token. Self-supervised is how LLMs are pretrained on unlabeled text.
Q2: Explain the bias-variance tradeoff
Total error = Bias² + Variance + Irreducible noise. Bias = systematic error (underfitting: model too simple). Variance = sensitivity to training data fluctuations (overfitting: model too complex). Can't reduce both simultaneously — increasing model complexity reduces bias but increases variance. Regularization, ensembles, and cross-validation help find the sweet spot.
Q3: When would you use L1 vs L2 regularization?
L2 (Ridge): Penalizes large weights (w²). Shrinks all weights toward zero but never to exactly zero. Use when all features are potentially useful. L1 (Lasso): Penalizes absolute weights (|w|). Produces sparse solutions — drives irrelevant feature weights to exactly zero. Use for feature selection when you have many features and suspect many are irrelevant. Elastic Net: Combination of both.
Q4: What is cross-validation and why do we use it?
Cross-validation estimates model performance on unseen data without wasting test data. In k-fold CV: split data into k folds, train on k-1 folds, evaluate on 1 fold, rotate, average results. Use when: dataset is small (can't afford a large hold-out set), comparing multiple models. Never use CV score to evaluate final model — use a true held-out test set.
Q5: How does gradient descent work?
Gradient descent minimizes a loss function by iteratively moving in the direction of the negative gradient: w = w - α × ∇L(w). The gradient tells us the direction of steepest increase; we go the opposite direction. Learning rate α controls step size. Variants: Batch GD (entire dataset per step), SGD (one sample per step), Mini-batch GD (typical in deep learning — balance of noise and speed).
Q6: What is the vanishing gradient problem?
In deep networks, gradients are backpropagated by multiplying derivatives at each layer (chain rule). If these derivatives are <1 (sigmoid saturates), multiplying 50 values <1 gives a near-zero gradient — early layers learn nothing. Solutions: ReLU (derivative = 1 for positive inputs), skip connections (ResNet), batch normalization, LSTM gates.
Q7: Explain the attention mechanism in 60 seconds
Attention lets each position in a sequence directly attend to every other position. For each token (Query), we compute similarity with all other tokens (Keys), softmax to get weights, and compute weighted sum of Values. Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. This gives any token direct access to any other token — unlike RNNs where information must travel through hidden states sequentially.
Q8: What is overfitting and how do you prevent it?
Overfitting: model memorizes training data, performs poorly on test data. High train accuracy, low test accuracy = overfitting. Prevention: Regularization (L1/L2), Dropout (randomly zero out neurons during training), Early stopping (monitor validation loss), Data augmentation, Cross-validation, Reduce model complexity, Get more data.
Q9: When would you use precision vs recall vs F1?
Precision = TP/(TP+FP): when false positives are costly (spam filter — don't mark legitimate email as spam). Recall = TP/(TP+FN): when false negatives are costly (cancer detection — don't miss a real cancer). F1 = harmonic mean of precision and recall: when classes are imbalanced and you need to balance both. AUC-ROC: threshold-independent, overall ranking quality.
Q10: What is data leakage and how do you detect it?
Data leakage: information from the future or test set accidentally enters training. Results in suspiciously high train/validation accuracy that collapses in production. Examples: using "days_until_churn" as a feature to predict churn (only know this after churn happens), normalizing before train/test split (test mean leaks into training). Detection: if model performs perfectly on validation but fails in production, suspect leakage.
Q11: Explain bagging vs boosting
Bagging (Bootstrap Aggregating): train many models independently on random subsets of data, average predictions. Reduces variance. Example: Random Forest. Boosting: train models sequentially, each focusing on previous model's errors. Reduces bias. Examples: XGBoost, AdaBoost, LightGBM. Boosting usually achieves better accuracy; bagging is more robust to noise.
Q12: How does BERT differ from GPT?
BERT: Encoder-only, bidirectional (sees full context), trained with Masked Language Modeling (predict masked tokens). Best for understanding tasks: classification, NER, Q&A. GPT: Decoder-only, autoregressive (left-to-right), trained to predict next token. Best for generation tasks. BERT is better at sentence understanding; GPT is better at text generation.
Q13: How do you handle imbalanced datasets?
1) Resample: oversample minority class (SMOTE), undersample majority class. 2) Class weights: increase loss weight for minority class in training. 3) Threshold tuning: lower decision threshold for minority class. 4) Evaluation: never use accuracy — use F1, AUC-ROC, or precision-recall curve. 5) Collect more minority class data if possible. Choice depends on dataset size and how imbalanced it is.
Q14: What is the difference between generative and discriminative models?
Discriminative: Models P(y|x) directly — learns the decision boundary between classes. Examples: logistic regression, SVM, neural networks. Generative: Models the joint distribution P(x, y) or P(x) — learns how data is generated. Examples: Naive Bayes, GANs, VAEs, diffusion models. Discriminative models are usually more accurate for classification; generative models can generate new data and handle missing features better.
Q15: What is LoRA and when would you use it?
LoRA (Low-Rank Adaptation) freezes pretrained model weights and injects trainable low-rank decomposition matrices (ΔW = A×B, rank r≪d) into attention layers. Trains 0.1-1% of parameters, reducing GPU memory 10-50×. Use when: fine-tuning LLMs for specific tasks/styles, limited GPU budget, need multiple task-specific adapters (swap without reloading base model). Don't use for domain knowledge injection — use RAG instead.
Topic 03
Python & SQL Patterns for ML Roles
Python patterns you must know
- List comprehension:
[f(x) for x in data if condition(x)] - Dictionary comprehension:
{k: v for k, v in pairs} - Pandas groupby:
df.groupby('category')['value'].agg(['mean', 'count']) - NumPy vectorization: Replace Python loops with
np.where(),np.sum(axis=1) - Scikit-learn pipeline:
Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) - Train-test split: Always use
stratify=yfor classification tasks
SQL patterns for data science interviews
- Window functions:
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) - Running totals:
SUM(amount) OVER (PARTITION BY user ORDER BY date) - Self-join for cohort analysis: Join users table to events on user_id
- CASE WHEN:
CASE WHEN score > 0.5 THEN 'positive' ELSE 'negative' END - Retention query: Find users active in week 1 who were also active in week 2
Topic 04
Behavioral Interviews & the STAR Framework
STAR = Situation, Task, Action, Result. Every behavioral answer follows this structure. Prepare 8-10 STAR stories from your experience. Each story can be adapted to multiple questions.
| Common question | Story angle to use |
|---|---|
| Tell me about a time you dealt with a model failure in production | Technical crisis → root cause analysis → systematic fix → monitoring improvement |
| How did you influence a team without authority? | Data quality issue → built dashboard to show impact → stakeholders aligned → fixed pipeline |
| Describe a time you had to make a decision with incomplete data | Launched model with limited data → A/B test → iterated based on results |
| Tell me about your biggest technical failure | Data leakage bug → inflated metrics → caught in production → prevention system built |
| How did you prioritize competing projects? | Impact × effort matrix → stakeholder alignment → delivered highest value first |
Topic 05
Mock Interview Strategy
- Think out loud: Interviewers can't read your mind. Narrate your reasoning, even when stuck
- "I don't know" is fine: "I'm not certain, but my intuition is... Let me reason through it..." shows better thinking than a wrong confident answer
- Ask clarifying questions: "Are we optimizing for latency or throughput?" "What scale are we talking about?" This shows senior thinking
- Start simple: Always propose the simplest solution first, then discuss improvements. Interviewers want to see your thought process, not just the final answer
- Check your assumptions: "I'm assuming [X] — is that correct?" prevents going down the wrong path for 20 minutes
Common mistakes
- Jumping to complex models before establishing a baseline
- Ignoring evaluation metrics and class imbalance
- Not discussing tradeoffs (accuracy vs. latency, precision vs. recall)
- Forgetting to mention monitoring and retraining in system design
- Giving textbook answers without connecting to real experience
Topic 06
30-Day Interview Prep Plan
| Week | Focus | Daily activities |
|---|---|---|
| Week 1 | ML Fundamentals | Study bias-variance, regularization, evaluation metrics. Implement linear/logistic regression from scratch. 2 LeetCode problems/day |
| Week 2 | Deep Learning & NLP | Study neural networks, attention, transformers. Fine-tune a Hugging Face model. 2 LeetCode problems/day |
| Week 3 | System Design | Study recommendation systems, search ranking. Design 2 ML systems per day. Read Chip Huyen's blog posts |
| Week 4 | Mock Interviews | Daily mock interviews (ML theory + system design). Record yourself. Review and iterate. Practice behavioral stories |
Final checklist before interviews
- Can you explain gradient descent, backprop, and attention in 2 minutes each? (whiteboard practice)
- Have you memorized the confusion matrix and derived precision, recall, F1 from it?
- Can you design a recommendation system architecture from scratch in 45 minutes?
- Do you have 8 STAR stories ready, each under 3 minutes?
- Have you done at least 5 full mock interviews with a friend or on interviewing.io?
- Do you have questions ready to ask the interviewer? (Shows genuine interest)