AI & ML Interview Prep: Top 50 Questions & Strategy

01Interview Types

02Top 25 ML Theory Q&A

03Python & SQL Patterns

04Behavioral & STAR

05Mock Interview Strategy

0630-Day Prep Plan

Topic 01

The 4 Interview Types at AI/ML Roles

Type	What they test	How to prepare
ML Theory	Fundamentals: bias-variance, regularization, backprop, attention, evaluation metrics	This module + flashcards + teach-it-back method
Coding (DSA)	Arrays, graphs, trees, DP — same as SWE roles	LeetCode medium (150 questions minimum)
ML System Design	Design recommendation, search, fraud detection at scale	Module 9 + Chip Huyen's "Designing ML Systems" book
Behavioral	Leadership, project failures, cross-team collaboration	STAR stories from your real experience

Topic 02

Top 25 ML Theory Questions & Answers

Q1: What's the difference between supervised, unsupervised, and self-supervised learning?

Supervised: Labeled data (X, y) — classification, regression. Unsupervised: No labels — clustering, dimensionality reduction, anomaly detection. Self-supervised: Labels generated from the data itself — BERT predicts masked words, GPT predicts next token. Self-supervised is how LLMs are pretrained on unlabeled text.

Q2: Explain the bias-variance tradeoff

Total error = Bias² + Variance + Irreducible noise. Bias = systematic error (underfitting: model too simple). Variance = sensitivity to training data fluctuations (overfitting: model too complex). Can't reduce both simultaneously — increasing model complexity reduces bias but increases variance. Regularization, ensembles, and cross-validation help find the sweet spot.

Q3: When would you use L1 vs L2 regularization?

L2 (Ridge): Penalizes large weights (w²). Shrinks all weights toward zero but never to exactly zero. Use when all features are potentially useful. L1 (Lasso): Penalizes absolute weights (|w|). Produces sparse solutions — drives irrelevant feature weights to exactly zero. Use for feature selection when you have many features and suspect many are irrelevant. Elastic Net: Combination of both.

Q4: What is cross-validation and why do we use it?

Cross-validation estimates model performance on unseen data without wasting test data. In k-fold CV: split data into k folds, train on k-1 folds, evaluate on 1 fold, rotate, average results. Use when: dataset is small (can't afford a large hold-out set), comparing multiple models. Never use CV score to evaluate final model — use a true held-out test set.

Q5: How does gradient descent work?

Gradient descent minimizes a loss function by iteratively moving in the direction of the negative gradient: w = w - α × ∇L(w). The gradient tells us the direction of steepest increase; we go the opposite direction. Learning rate α controls step size. Variants: Batch GD (entire dataset per step), SGD (one sample per step), Mini-batch GD (typical in deep learning — balance of noise and speed).

Q6: What is the vanishing gradient problem?

In deep networks, gradients are backpropagated by multiplying derivatives at each layer (chain rule). If these derivatives are <1 (sigmoid saturates), multiplying 50 values <1 gives a near-zero gradient — early layers learn nothing. Solutions: ReLU (derivative = 1 for positive inputs), skip connections (ResNet), batch normalization, LSTM gates.

Q7: Explain the attention mechanism in 60 seconds

Attention lets each position in a sequence directly attend to every other position. For each token (Query), we compute similarity with all other tokens (Keys), softmax to get weights, and compute weighted sum of Values. Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. This gives any token direct access to any other token — unlike RNNs where information must travel through hidden states sequentially.

Q8: What is overfitting and how do you prevent it?

Overfitting: model memorizes training data, performs poorly on test data. High train accuracy, low test accuracy = overfitting. Prevention: Regularization (L1/L2), Dropout (randomly zero out neurons during training), Early stopping (monitor validation loss), Data augmentation, Cross-validation, Reduce model complexity, Get more data.

Q9: When would you use precision vs recall vs F1?

Precision = TP/(TP+FP): when false positives are costly (spam filter — don't mark legitimate email as spam). Recall = TP/(TP+FN): when false negatives are costly (cancer detection — don't miss a real cancer). F1 = harmonic mean of precision and recall: when classes are imbalanced and you need to balance both. AUC-ROC: threshold-independent, overall ranking quality.

Q10: What is data leakage and how do you detect it?

Data leakage: information from the future or test set accidentally enters training. Results in suspiciously high train/validation accuracy that collapses in production. Examples: using "days_until_churn" as a feature to predict churn (only know this after churn happens), normalizing before train/test split (test mean leaks into training). Detection: if model performs perfectly on validation but fails in production, suspect leakage.

Q11: Explain bagging vs boosting

Bagging (Bootstrap Aggregating): train many models independently on random subsets of data, average predictions. Reduces variance. Example: Random Forest. Boosting: train models sequentially, each focusing on previous model's errors. Reduces bias. Examples: XGBoost, AdaBoost, LightGBM. Boosting usually achieves better accuracy; bagging is more robust to noise.

Q12: How does BERT differ from GPT?

BERT: Encoder-only, bidirectional (sees full context), trained with Masked Language Modeling (predict masked tokens). Best for understanding tasks: classification, NER, Q&A. GPT: Decoder-only, autoregressive (left-to-right), trained to predict next token. Best for generation tasks. BERT is better at sentence understanding; GPT is better at text generation.

Q13: How do you handle imbalanced datasets?

1) Resample: oversample minority class (SMOTE), undersample majority class. 2) Class weights: increase loss weight for minority class in training. 3) Threshold tuning: lower decision threshold for minority class. 4) Evaluation: never use accuracy — use F1, AUC-ROC, or precision-recall curve. 5) Collect more minority class data if possible. Choice depends on dataset size and how imbalanced it is.

Q14: What is the difference between generative and discriminative models?

Discriminative: Models P(y|x) directly — learns the decision boundary between classes. Examples: logistic regression, SVM, neural networks. Generative: Models the joint distribution P(x, y) or P(x) — learns how data is generated. Examples: Naive Bayes, GANs, VAEs, diffusion models. Discriminative models are usually more accurate for classification; generative models can generate new data and handle missing features better.

Q15: What is LoRA and when would you use it?

LoRA (Low-Rank Adaptation) freezes pretrained model weights and injects trainable low-rank decomposition matrices (ΔW = A×B, rank r≪d) into attention layers. Trains 0.1-1% of parameters, reducing GPU memory 10-50×. Use when: fine-tuning LLMs for specific tasks/styles, limited GPU budget, need multiple task-specific adapters (swap without reloading base model). Don't use for domain knowledge injection — use RAG instead.

Topic 03

Python & SQL Patterns for ML Roles

Python patterns you must know

List comprehension: [f(x) for x in data if condition(x)]
Dictionary comprehension: {k: v for k, v in pairs}
Pandas groupby: df.groupby('category')['value'].agg(['mean', 'count'])
NumPy vectorization: Replace Python loops with np.where(), np.sum(axis=1)
Scikit-learn pipeline: Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
Train-test split: Always use stratify=y for classification tasks

SQL patterns for data science interviews

Window functions: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC)
Running totals: SUM(amount) OVER (PARTITION BY user ORDER BY date)
Self-join for cohort analysis: Join users table to events on user_id
CASE WHEN: CASE WHEN score > 0.5 THEN 'positive' ELSE 'negative' END
Retention query: Find users active in week 1 who were also active in week 2

Topic 04

Behavioral Interviews & the STAR Framework

STAR = Situation, Task, Action, Result. Every behavioral answer follows this structure. Prepare 8-10 STAR stories from your experience. Each story can be adapted to multiple questions.

Common question	Story angle to use
Tell me about a time you dealt with a model failure in production	Technical crisis → root cause analysis → systematic fix → monitoring improvement
How did you influence a team without authority?	Data quality issue → built dashboard to show impact → stakeholders aligned → fixed pipeline
Describe a time you had to make a decision with incomplete data	Launched model with limited data → A/B test → iterated based on results
Tell me about your biggest technical failure	Data leakage bug → inflated metrics → caught in production → prevention system built
How did you prioritize competing projects?	Impact × effort matrix → stakeholder alignment → delivered highest value first

Topic 05

Mock Interview Strategy

Think out loud: Interviewers can't read your mind. Narrate your reasoning, even when stuck
"I don't know" is fine: "I'm not certain, but my intuition is... Let me reason through it..." shows better thinking than a wrong confident answer
Ask clarifying questions: "Are we optimizing for latency or throughput?" "What scale are we talking about?" This shows senior thinking
Start simple: Always propose the simplest solution first, then discuss improvements. Interviewers want to see your thought process, not just the final answer
Check your assumptions: "I'm assuming [X] — is that correct?" prevents going down the wrong path for 20 minutes

Common mistakes

Jumping to complex models before establishing a baseline
Ignoring evaluation metrics and class imbalance
Not discussing tradeoffs (accuracy vs. latency, precision vs. recall)
Forgetting to mention monitoring and retraining in system design
Giving textbook answers without connecting to real experience

Topic 06

30-Day Interview Prep Plan

Week	Focus	Daily activities
Week 1	ML Fundamentals	Study bias-variance, regularization, evaluation metrics. Implement linear/logistic regression from scratch. 2 LeetCode problems/day
Week 2	Deep Learning & NLP	Study neural networks, attention, transformers. Fine-tune a Hugging Face model. 2 LeetCode problems/day
Week 3	System Design	Study recommendation systems, search ranking. Design 2 ML systems per day. Read Chip Huyen's blog posts
Week 4	Mock Interviews	Daily mock interviews (ML theory + system design). Record yourself. Review and iterate. Practice behavioral stories

Final checklist before interviews

Can you explain gradient descent, backprop, and attention in 2 minutes each? (whiteboard practice)
Have you memorized the confusion matrix and derived precision, recall, F1 from it?
Can you design a recommendation system architecture from scratch in 45 minutes?
Do you have 8 STAR stories ready, each under 3 minutes?
Have you done at least 5 full mock interviews with a friend or on interviewing.io?
Do you have questions ready to ask the interviewer? (Shows genuine interest)

AI & ML Interview Prep
Top 50 Questions, Coding, Behavioral & Mock Strategy

Contents