I remember the first time someone told me to "just use a pre-trained model." I had no idea what that meant. Pre-trained on what? Why does it work on my data? What happens when it gives wrong answers — and who's responsible?

If you've ever felt that way — like AI is a black box that works until it doesn't, and you don't know why — this blog is for you. By the end, you won't just be able to use ML. You'll be able to reason about it, debug it, and talk about it confidently in interviews.

Section 01

The Single Most Important Reframe

Every SDE comes into ML carrying one assumption that doesn't hold: that programming means writing logic. You define the rules. The computer executes them. Input goes in, output comes out, and you can trace every step.

Machine learning flips this completely.

The Vending Machine Analogy

Traditional programming: You build a vending machine. You write rules: "if button A2 → dispense Coke, if button B3 → dispense chips." You control every outcome.

Machine learning: You show the machine 10,000 examples of (button pressed → item dispensed). It figures out the pattern itself. You never wrote a single rule — the machine inferred them from data.

The output is not code. It's a function — a set of numerical weights — that approximates the relationship between inputs and outputs.

This is the fundamental shift. In ML, you are not writing rules. You are writing a function that learns rules from examples. Your job as an SDE is to: (1) give it the right examples, (2) define what "correct" means (the loss function), and (3) evaluate whether it learned the right thing.

Why this matters for SDEs building AI features

When your ML feature gives wrong answers, the bug is not in a line of code you wrote. It's in the data, the objective, or the evaluation. Debugging ML means asking: "Did the model see enough examples of this case? Did we measure the right thing?" — not "which if-statement is wrong?"

Section 02

The 4 Learning Paradigms — And Where GPT Fits

Not all ML is the same. There are four fundamentally different ways a model can learn, and knowing which one you're working with changes everything about how you build and evaluate it.

Supervised Learning
Most Common

You provide labelled examples: (input, correct output). The model learns to map inputs to outputs.

Examples: Email spam classifier (input: email, label: spam/not), image recognition (input: photo, label: cat/dog), fraud detection.

Unsupervised Learning
No Labels

No labels. The model finds patterns and structure in the data on its own.

Examples: Customer segmentation (group users by behavior), anomaly detection (find the unusual transactions), topic modeling (group documents by theme).

Self-Supervised Learning
How GPT Works

The model creates its own labels from raw data. No human annotation needed.

GPT's task: Given text from the internet, predict the next word. The internet is the training set. Every sentence is both input and label. This is how LLMs are trained — no human labelled anything.

Reinforcement Learning
Reward-Driven

The model takes actions in an environment and receives rewards or penalties. It learns to maximize reward over time.

Examples: AlphaGo (win the game), ChatGPT RLHF (be more helpful — humans rate responses as rewards).

Why GPT isn't supervised learning

A common misconception: "GPT was trained on labelled data." Not quite. The pretraining step is self-supervised — predict the next token from raw internet text, no human labels. The fine-tuning step (RLHF) uses human preferences as a reward signal. Two different paradigms, stacked on top of each other. This is why GPT knows facts (from pretraining) but is also helpful and safe (from RLHF).

Section 03

Overfitting & Underfitting — The Two Ways Models Fail

Every model fails in one of two directions. Understanding which direction your model is failing tells you exactly what to fix.

The Exam Student Analogy

Underfitting is the student who studied the wrong things — they skimmed the textbook and showed up to the exam without understanding the material. They fail both the practice exam and the real one.

Overfitting is the student who memorized last year's exact exam paper. They ace the practice test. But the real exam has slightly different questions — and they fail completely, because they learned the specific answers, not the underlying concepts.

UnderfittingOverfitting
Training accuracyLowVery high
Validation accuracyLowMuch lower than training
CauseModel too simple, not enough trainingModel memorized training data
FixMore complex model, more features, train longerMore data, dropout, regularization, simpler model
Production signalBad on all inputsGreat on seen patterns, bad on new ones

In production, overfitting is the sneakier problem. Your model tests great on historical data. You ship it. Then user behavior shifts slightly — a new device type, a seasonal pattern, a new market — and the model collapses. It "learned" the quirks of your training set, not the underlying signal.

The production version of overfitting

At large tech companies, models often overfit to the data distribution at training time. When the world changes (new user cohorts, new products, new language), the model silently degrades. This is called distribution shift. The fix is continuous monitoring — alert when the model's predictions deviate from expected distributions, and retrain regularly.

Section 04

The Bias-Variance Tradeoff

Underfitting and overfitting have a formal name: the bias-variance tradeoff. It's one of the most fundamental concepts in ML, and it explains why there's no "perfect" model — only tradeoffs.

  • Bias is the error from wrong assumptions. A model with high bias is too simple — it misses important patterns. Like a map that only shows countries, no roads. Useful for some things, useless for others. High bias = underfitting.
  • Variance is sensitivity to fluctuations in training data. A model with high variance changes dramatically with small changes in training data — it's learned the noise, not the signal. High variance = overfitting.

The tradeoff: reducing bias (making the model more complex) tends to increase variance. Reducing variance (simplifying the model) tends to increase bias. You can't eliminate both simultaneously. The goal is to find the sweet spot where total error (bias² + variance + irreducible noise) is minimized.

The Map Analogy

High bias (underfitting): A world map where every country is a single color. It shows the big picture but misses every road, city, and neighborhood. Wrong assumptions baked in.

High variance (overfitting): A map so detailed it memorizes every pothole from when it was surveyed in 2019. It's "accurate" — for 2019. Now half the roads are wrong.

Good model: A map detailed enough to navigate, updated regularly, and not so granular that minor road changes break it.

Section 05

Train / Val / Test Split — And the Silent Killer: Data Leakage

You split your data into three sets. This is not optional — it's the foundation of honest model evaluation.

SplitUsed ForTypical SizeThe Rule
Training setTraining the model — the model sees this data and updates its weights70–80%Model can see this as many times as needed
Validation setTuning hyperparameters, comparing models, early stopping10–15%Model never trains on this — but you look at it to make decisions
Test setFinal evaluation — the "final exam" score you report10–15%Touch it exactly once. After you've finalized everything.

The test set rule is sacred: you are not allowed to make any decisions based on the test set. The moment you do, you've contaminated it. Every time you look at test set performance and adjust your model accordingly, the test set is no longer an honest measure of real-world performance — it's become another training signal.

Data Leakage — The Bug That Makes Your Model Look Amazing

Data leakage is when information from the future — or from outside what the model would know at prediction time — accidentally ends up in your training data. The model learns this leaked signal, gets artificially high accuracy, and then fails completely in production where the leak doesn't exist.

Real example of data leakage

A team built a fraud detection model. Training accuracy: 99%. Production: 60%. The bug: they included the transaction_reversed field in features. This field is only set after a transaction is reviewed and reversed — which is exactly what you're trying to predict. The model "learned" that if transaction_reversed=True, it's fraud. Of course it did — that's the outcome, not a feature.

Common leakage patterns to watch for:

  • Target leakage: A feature that's computed from or correlated with the label, and wouldn't be available at prediction time
  • Train-test contamination: Normalizing the entire dataset (including test) before splitting — the test set's statistics leak into training normalization
  • Temporal leakage: Using future data to predict the past (shuffling time-series data before splitting)

Section 06

Metrics That Actually Matter: Precision, Recall, and F1

"Accuracy" sounds like the right metric. It's usually wrong. Here's why: imagine a fraud detection model that labels every single transaction as "not fraud." In a dataset where 0.1% of transactions are fraud, this model has 99.9% accuracy. It is also completely useless.

The problem is class imbalance — the real-world signal you care about is rare. Accuracy rewards predicting the majority class. You need better metrics.

MetricFormulaPlain English
PrecisionTP / (TP + FP)Of all the times the model said "yes", how often was it right? Don't cry wolf.
RecallTP / (TP + FN)Of all the actual "yes" cases, how many did the model find? Don't miss any.
F1 Score2 × (P × R) / (P + R)Harmonic mean of precision and recall. Useful when both matter.
AUC-ROCArea under ROC curveHow well can the model distinguish between classes at any threshold? 0.5 = random, 1.0 = perfect.

When to Prioritize Which

Spam Filter vs Cancer Screening

Spam filter → prioritize precision. A false positive means a real email ends up in spam. That's bad — someone misses an important message. A false negative (spam gets through) is annoying but not catastrophic. You want high precision: when you say "spam", be sure.

Cancer screening → prioritize recall. A false negative means a patient with cancer is told they're fine. That's catastrophic. A false positive (healthy person gets further tests) is expensive but not fatal. You want high recall: catch every real case, even if some are false alarms.

Interview question

"You're building a content moderation system. Which metric do you optimize for?" — There's no single right answer. If the cost of showing harmful content is higher than the cost of incorrectly removing benign content, optimize for recall. State the tradeoff explicitly. Interviewers want to see that you reason through it, not that you memorize one answer.

Section 07

Loss Functions — Telling the Model What "Wrong" Means

Training a model is an optimization problem. The model starts with random weights. It makes a prediction. You measure how wrong it was. It adjusts its weights to be less wrong next time. The loss function is the measure of "how wrong."

Different tasks need different loss functions. Using the wrong one is a common bug — the model optimizes the wrong thing and gives useless outputs.

Loss FunctionUsed ForWhat It Measures
MSE (Mean Squared Error)Regression — predicting a numberAverage squared difference between prediction and true value. Penalizes large errors heavily.
Cross-EntropyClassification — predicting a categoryHow surprised the model is by the correct answer. If model is 99% confident and right → low loss. 99% confident and wrong → very high loss.
Binary Cross-EntropyBinary classification (yes/no)Special case of cross-entropy for two classes. Used in spam detection, fraud detection.

Here's the intuition for cross-entropy: imagine the model is a betting agent. For classification, it places bets on which class is correct. Cross-entropy measures how much money it loses based on its confidence and correctness. If it was 99% confident in the wrong answer, it loses a lot. If it was uncertain (50/50) and wrong, it loses less — it wasn't really betting heavily on the wrong answer.

Section 08

Gradient Descent — How the Model Actually Learns

You now know what the model is trying to minimize (the loss). Gradient descent is the algorithm that does the minimizing. It's the engine of all of machine learning — used in linear regression, neural networks, and LLMs alike.

The Blindfolded Hiker

Imagine you're blindfolded on a hilly landscape. Your goal is to reach the lowest valley. You can't see the whole landscape — you can only feel the ground under your feet. Your strategy: at each step, feel which direction slopes downward most steeply, and take a small step in that direction. Repeat until you're not going downhill anymore.

That's gradient descent. The landscape is the loss function. Your position is the model's weights. The slope at your feet is the gradient. The step size is the learning rate.

Formally, at each training step:

# The gradient descent update rule
new_weights = old_weights - learning_rate × gradient_of_loss

# In code (conceptual):
for batch in training_data:
    predictions  = model.forward(batch.inputs)
    loss         = loss_fn(predictions, batch.labels)
    gradients    = loss.backward()           # compute gradient
    weights     -= learning_rate * gradients # update weights

The Learning Rate: The Hyperparameter That Breaks Everything

  • Too high: Steps are too big. You overshoot the valley, bounce around, never converge. Loss oscillates instead of decreasing.
  • Too low: Steps are tiny. Training takes forever. You might get stuck in a local minimum.
  • Just right: Loss decreases smoothly. Model converges.

Modern optimizers (Adam, AdamW) automatically adapt the learning rate for each parameter, which is why you rarely tune it manually anymore. But knowing why it matters helps you debug training runs that won't converge.

Why gradient descent doesn't always find the best solution

Gradient descent finds a local minimum, not necessarily the global minimum. The loss landscape for a neural network has millions of dimensions and many local valleys. The good news: for deep neural networks, most local minima are "good enough" — the loss landscape is smooth enough that SGD with a good optimizer almost always converges to a useful solution. Saddle points (flat regions where gradient is near zero) are the bigger practical problem, which is why momentum-based optimizers like Adam work better than vanilla gradient descent.

Section 09

What This Means for Your Interview

These concepts show up constantly — not just in "ML engineer" interviews but in any SDE role where you're building AI features. Here's exactly what you should be able to do:

Questions You Can Now Answer

  • "What's the difference between supervised and unsupervised learning?" → One has labels, one finds patterns without labels. Bonus: mention self-supervised (GPT) and reinforcement learning.
  • "Our model performs great in testing but poorly in production. What could be wrong?" → Data leakage, distribution shift, overfitting to historical patterns, train/test contamination. Walk through each.
  • "How do you evaluate a classification model?" → Don't just say accuracy. Walk through precision, recall, F1, and AUC-ROC. Explain when each matters with a real example.
  • "What is gradient descent?" → Blindfolded hiker analogy. The loss function is the landscape, weights are your position, gradient is the slope, learning rate is step size. Then mention SGD vs Adam.
  • "What is overfitting and how do you fix it?" → Exam student who memorized last year's paper. Fix with: more data, regularization (L1/L2), dropout, simpler model, early stopping.

The One Concept That Unlocks Everything Else

If you internalize one thing from this blog, make it this: in ML, you are not writing rules — you are defining what "correct" means and showing examples. Every problem in ML comes back to: did the model see the right examples? Did we measure the right thing? Are we evaluating honestly?

With this mental model, you'll find that the rest of the AI course — transformers, RAG, agents, system design — all make more intuitive sense. These aren't magic black boxes. They're functions that learned from data, optimized for a loss, and evaluated on metrics. You understand all three now.

The question that separates good candidates

"Walk me through how you'd approach an ML problem from scratch." Most candidates describe a model architecture. Strong candidates start with: clarify the objective, define the success metric, understand the data, choose a simple baseline, evaluate honestly, then iterate. The mental model, not the model.