ML Mental Model & How Models Learn

The ML Mental Model — Programming vs Learning

Traditional Programming

You write the rules → computer follows them

Input + Rules → Output
         ↑
  You write these rules manually

Example: if spam_words > 3: mark_spam()

Machine Learning

You give examples → computer learns the rules

Input + Output → Rules (model)
         ↑
  You provide labeled examples

Example: 10,000 emails (spam/not) → model learns spam rules

      The model IS the learned rules — it's a function f(input) → output whose parameters were found by looking at examples, not written by hand.
    

The 4 Learning Paradigms CRITICAL

Paradigm	Data Type	How It Works	Key Example
Supervised	Labeled (input + correct answer)	Model learns mapping from input to known output	Spam detection, price prediction, image classification
Unsupervised	Unlabeled (input only)	Find hidden structure in data without labels	Customer segmentation, anomaly detection, topic modeling
Self-supervised	Unlabeled (creates its own labels)	Generate labels from the data itself — no humans needed	How GPT is trained: predict next word given all previous words
Reinforcement	Rewards from environment	Agent learns by taking actions and getting rewards/penalties	Game-playing AI, RLHF (how ChatGPT was made helpful)

Interview tip: When asked "how is GPT trained?" — say "self-supervised learning on next-token prediction." The model is trained to predict the next word given all preceding words, on trillions of tokens. No human labels needed. From this single task it learns grammar, facts, reasoning, code, and instruction following.

Bias-Variance Tradeoff & Overfitting CRITICAL

High Bias (Underfitting)Model too simple → wrong on training AND test data. Needs more complexity or more features.

High Variance (Overfitting)Model too complex → memorizes training data → fails on new data. Training accuracy ≫ test accuracy.

Sweet SpotLow bias + low variance = generalizes well. This is always the goal.

Overfitting Signals

Training accuracy = 99%, test accuracy = 72%
Validation loss increases while training loss decreases
Model memorizes noise in training data

Fixes for Overfitting

More data — almost always the best fix
Simpler model — fewer parameters
L1/L2 regularization — penalize large weights
Dropout — randomly turn off neurons
Early stopping — stop when val loss rises

Classic interview Q: "Training loss is 0.02 but test loss is 2.8. What's happening and how do you fix it?"
Answer: Severe overfitting. The model memorized the training data. First try: collect more data. Then try regularization (L2 or dropout). Check if features are truly informative — noisy features contribute to overfitting. Consider a simpler model architecture.

Train / Validation / Test Split & Data Leakage CRITICAL

Training Set (~70%)

Model learns parameters (weights) from this. Model sees it many times during training.

Validation Set (~15%)

Tune hyperparameters, detect overfitting early. Never adjust model weights based on this.

Test Set (~15%)

Final evaluation — touch this ONLY ONCE, at the very end. Never use it to make decisions.

Data Leakage — The #1 Production Killer

What it is: when future or test-set information leaks into training, making evaluation look better than real-world performance.

Classic example: You normalize features (compute mean/std on the full dataset before splitting). The mean you computed includes test data → test is no longer truly "unseen" → model was trained with knowledge of test samples → your reported 95% accuracy is fake. Real-world accuracy is lower.

Fix: Always split first, then fit any preprocessing (normalization, imputation, encoding) on training data only, then apply to val/test.

Evaluation Metrics — When to Use What CRITICAL

Confusion Matrix

	Predicted +	Predicted −
Actual +	TP	FN
Actual −	FP	TN

The 4 Core Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 = 2 × P × R / (P + R)

Metric	When to Use	Why	Example Task
Accuracy	Only on balanced datasets	Meaningless on imbalanced data	MNIST digit classification
Precision	False positives are costly	Focus: don't annoy with wrong positives	Spam filter (don't block real email)
Recall	False negatives are costly	Focus: never miss a real positive	Cancer detection, fraud detection
F1	Need balance of both	Harmonic mean punishes extreme imbalances	Named entity recognition, most NLP
AUC-ROC	Ranking models, threshold-free comparison	Measures discrimination ability across all thresholds	Risk scoring, ad CTR prediction

Classic trap question: "Fraud model has 99% accuracy — is it good?"
Answer: NO. If 1% of transactions are fraud, predicting "not fraud" always gives 99% accuracy but catches zero fraud. You need high Recall for fraud detection — don't miss any fraudulent transaction. The accuracy number is completely misleading here.

Gradient Descent — How Models Learn (Intuition)

The Loop

Model makes a prediction with current weights
Compare to correct answer → compute loss (error)
Gradient = direction of steepest increase in loss
Update weights in the opposite direction (downhill)
Repeat millions of times → loss decreases → model improves

w = w − α × (∂L/∂w) α = learning rate

Learning Rate

Too large (α = 0.1)Overshoots the minimum, model diverges or oscillates

Too small (α = 0.00001)Trains too slowly, may get stuck in local minima

Just right (α = 0.001)Converges smoothly to low loss (Adam optimizer finds this automatically)

Adam optimizer adapts the learning rate per parameter automatically. It's the default for almost every modern model — you don't need to tune it by hand.

Loss Functions — Cross-Entropy vs MSE

Mean Squared Error (MSE)

L = (1/n) × Σ (y_pred − y_true)²

Use for regression — predicting continuous values.

Examples: house price, stock prediction, age estimation.

Penalizes large errors more than small ones (squared term). Easy to understand and interpret.

Binary Cross-Entropy

L = −[y·log(p) + (1−y)·log(1−p)]

Use for classification — predicting class labels or probabilities.

Examples: spam detection, sentiment analysis, churn prediction.

Why NOT MSE for classification? MSE gradients through sigmoid become flat near 0 and 1 — the model stops learning. Cross-entropy gradients flow cleanly through sigmoid.

Interview Q: "Why do we use cross-entropy for classification and not MSE?"
Answer: Two reasons. (1) Class labels (0 and 1) have no magnitude relationship — MSE doesn't make sense semantically for classes. (2) Cross-entropy gives much better gradient flow through sigmoid activations. MSE with sigmoid creates vanishing gradients near 0 and 1, making the model slow to train or unable to learn at all.

Quick Reference — Key Formulas

Metrics

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 = 2 × P × R / (P + R)

Accuracy = (TP+TN) / total

Loss Functions

MSE = (1/n)Σ(ŷ−y)²

CE = −Σ y·log(ŷ)

BCE = −[y·log(p)+(1−y)·log(1−p)]

Regularization

L2: loss + λ·Σwᵢ²

L1: loss + λ·Σ|wᵢ|

L1 can zero out weights (feature selection). L2 shrinks all weights. Both penalize large weights to reduce overfitting.