Back to Tracker
Phase 1 — AI Literacy

ML Mental Model & How Models Learn

The minimum every SDE must know. These topics cover every AI rapid-fire question you'll face at product companies in 2025.

Supervised Learning Bias-Variance Tradeoff Precision & Recall Data Leakage Gradient Descent Cross-Entropy
The ML Mental Model — Programming vs Learning
Traditional Programming

You write the rules → computer follows them

Input + Rules → Output
         ↑
  You write these rules manually

Example: if spam_words > 3: mark_spam()

Machine Learning

You give examples → computer learns the rules

Input + Output → Rules (model)
         ↑
  You provide labeled examples

Example: 10,000 emails (spam/not) → model learns spam rules

The model IS the learned rules — it's a function f(input) → output whose parameters were found by looking at examples, not written by hand.
The 4 Learning Paradigms CRITICAL
ParadigmData TypeHow It WorksKey Example
Supervised Labeled (input + correct answer) Model learns mapping from input to known output Spam detection, price prediction, image classification
Unsupervised Unlabeled (input only) Find hidden structure in data without labels Customer segmentation, anomaly detection, topic modeling
Self-supervised Unlabeled (creates its own labels) Generate labels from the data itself — no humans needed How GPT is trained: predict next word given all previous words
Reinforcement Rewards from environment Agent learns by taking actions and getting rewards/penalties Game-playing AI, RLHF (how ChatGPT was made helpful)
Interview tip: When asked "how is GPT trained?" — say "self-supervised learning on next-token prediction." The model is trained to predict the next word given all preceding words, on trillions of tokens. No human labels needed. From this single task it learns grammar, facts, reasoning, code, and instruction following.
Bias-Variance Tradeoff & Overfitting CRITICAL
High Bias (Underfitting)Model too simple → wrong on training AND test data. Needs more complexity or more features.
High Variance (Overfitting)Model too complex → memorizes training data → fails on new data. Training accuracy ≫ test accuracy.
Sweet SpotLow bias + low variance = generalizes well. This is always the goal.
Overfitting Signals
  • Training accuracy = 99%, test accuracy = 72%
  • Validation loss increases while training loss decreases
  • Model memorizes noise in training data
Fixes for Overfitting
  • More data — almost always the best fix
  • Simpler model — fewer parameters
  • L1/L2 regularization — penalize large weights
  • Dropout — randomly turn off neurons
  • Early stopping — stop when val loss rises
Classic interview Q: "Training loss is 0.02 but test loss is 2.8. What's happening and how do you fix it?"
Answer: Severe overfitting. The model memorized the training data. First try: collect more data. Then try regularization (L2 or dropout). Check if features are truly informative — noisy features contribute to overfitting. Consider a simpler model architecture.
Train / Validation / Test Split & Data Leakage CRITICAL
Training Set (~70%)

Model learns parameters (weights) from this. Model sees it many times during training.

Validation Set (~15%)

Tune hyperparameters, detect overfitting early. Never adjust model weights based on this.

Test Set (~15%)

Final evaluation — touch this ONLY ONCE, at the very end. Never use it to make decisions.

Data Leakage — The #1 Production Killer

What it is: when future or test-set information leaks into training, making evaluation look better than real-world performance.

Classic example: You normalize features (compute mean/std on the full dataset before splitting). The mean you computed includes test data → test is no longer truly "unseen" → model was trained with knowledge of test samples → your reported 95% accuracy is fake. Real-world accuracy is lower.

Fix: Always split first, then fit any preprocessing (normalization, imputation, encoding) on training data only, then apply to val/test.
Evaluation Metrics — When to Use What CRITICAL
Confusion Matrix
Predicted +Predicted −
Actual +TPFN
Actual −FPTN
The 4 Core Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × P × R / (P + R)
MetricWhen to UseWhyExample Task
AccuracyOnly on balanced datasetsMeaningless on imbalanced dataMNIST digit classification
PrecisionFalse positives are costlyFocus: don't annoy with wrong positivesSpam filter (don't block real email)
RecallFalse negatives are costlyFocus: never miss a real positiveCancer detection, fraud detection
F1Need balance of bothHarmonic mean punishes extreme imbalancesNamed entity recognition, most NLP
AUC-ROCRanking models, threshold-free comparisonMeasures discrimination ability across all thresholdsRisk scoring, ad CTR prediction
Classic trap question: "Fraud model has 99% accuracy — is it good?"
Answer: NO. If 1% of transactions are fraud, predicting "not fraud" always gives 99% accuracy but catches zero fraud. You need high Recall for fraud detection — don't miss any fraudulent transaction. The accuracy number is completely misleading here.
Gradient Descent — How Models Learn (Intuition)

The Loop

  1. Model makes a prediction with current weights
  2. Compare to correct answer → compute loss (error)
  3. Gradient = direction of steepest increase in loss
  4. Update weights in the opposite direction (downhill)
  5. Repeat millions of times → loss decreases → model improves
w = w − α × (∂L/∂w) α = learning rate
Learning Rate
Too large (α = 0.1)Overshoots the minimum, model diverges or oscillates
Too small (α = 0.00001)Trains too slowly, may get stuck in local minima
Just right (α = 0.001)Converges smoothly to low loss (Adam optimizer finds this automatically)
Adam optimizer adapts the learning rate per parameter automatically. It's the default for almost every modern model — you don't need to tune it by hand.
Loss Functions — Cross-Entropy vs MSE
Mean Squared Error (MSE)
L = (1/n) × Σ (y_pred − y_true)²

Use for regression — predicting continuous values.

Examples: house price, stock prediction, age estimation.

Penalizes large errors more than small ones (squared term). Easy to understand and interpret.
Binary Cross-Entropy
L = −[y·log(p) + (1−y)·log(1−p)]

Use for classification — predicting class labels or probabilities.

Examples: spam detection, sentiment analysis, churn prediction.

Why NOT MSE for classification? MSE gradients through sigmoid become flat near 0 and 1 — the model stops learning. Cross-entropy gradients flow cleanly through sigmoid.
Interview Q: "Why do we use cross-entropy for classification and not MSE?"
Answer: Two reasons. (1) Class labels (0 and 1) have no magnitude relationship — MSE doesn't make sense semantically for classes. (2) Cross-entropy gives much better gradient flow through sigmoid activations. MSE with sigmoid creates vanishing gradients near 0 and 1, making the model slow to train or unable to learn at all.
Quick Reference — Key Formulas

Metrics

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × P × R / (P + R)
Accuracy = (TP+TN) / total

Loss Functions

MSE = (1/n)Σ(ŷ−y)²
CE = −Σ y·log(ŷ)
BCE = −[y·log(p)+(1−y)·log(1−p)]

Regularization

L2: loss + λ·Σwᵢ²
L1: loss + λ·Σ|wᵢ|

L1 can zero out weights (feature selection). L2 shrinks all weights. Both penalize large weights to reduce overfitting.