Classical ML algorithms still dominate in production for tabular data — which represents 80% of real-world ML jobs. Every data scientist at Google, Amazon, and every fintech company uses these algorithms daily. And every ML interview starts here.

Topic 01

Bias-Variance Tradeoff

ANALOGY

The archer analogy

Imagine four archers shooting at a target:

  • High bias, low variance: Arrows cluster tightly, but far from the bullseye — consistently wrong. (Underfitting)
  • Low bias, high variance: Arrows spread all over, centered around the bullseye — sometimes right, often wrong. (Overfitting)
  • High bias, high variance: The worst — spread out AND off-center
  • Low bias, low variance: Tight cluster at the bullseye — what we want
High Bias Low Variance Low Bias High Variance Low Bias Low Variance ✓

The archer analogy: bias = systematic error, variance = inconsistency

Total Error = Bias² + Variance + Irreducible Noise
  • High bias → underfitting: Model too simple (linear model on non-linear data). Fix: more model complexity, more features
  • High variance → overfitting: Model memorizes training data. Fix: regularization, more data, early stopping, dropout

Topic 02

Linear & Logistic Regression

LINEAR

Linear Regression: fitting a line

Linear regression predicts a continuous value by finding the best-fit line through your data. "Best fit" = minimizes the sum of squared errors (MSE).

ŷ = w₁x₁ + w₂x₂ + ... + b    |    Loss = (1/n)Σ(y - ŷ)²
  • Assumes: Linear relationship, no multicollinearity, homoscedastic residuals
  • Solution: Closed-form (Normal Equation) for small datasets, gradient descent for large
  • Regularization: L2 (Ridge) penalizes large weights, L1 (Lasso) can zero out features
LOGISTIC

Logistic Regression: why sigmoid?

For binary classification, we need output between 0 and 1 (a probability). The sigmoid function squashes any real number to (0, 1):

σ(z) = 1 / (1 + e⁻ᶻ)    |    z = w·x + b
Key Insight

Logistic regression predicts log-odds, not probability directly. "Log-odds of 0" → 50% probability. "Log-odds of 2" → 88%. Despite the name, it's a classification algorithm, not regression. The decision boundary is where σ(z) = 0.5, i.e., z = 0.

Topic 03

Decision Trees & Random Forests

TREES

How decision trees make splits

A decision tree asks a sequence of yes/no questions to classify a data point. Each split is chosen to maximize information gain (how much uncertainty is reduced by the split).

  • Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified. Gini = 1 - Σpᵢ²
  • Information Gain: Entropy before split - weighted entropy after split
  • Stopping criteria: Max depth, min samples per leaf, min information gain
The overfit problem

A fully-grown decision tree memorizes the training data perfectly (it can have one leaf per training sample). Always use max_depth or min_samples_leaf to regularize. A tree with depth 2-4 is often more generalizable than depth 20.

FORESTS

Random Forest: wisdom of crowds

Random forests build many decision trees and average their predictions. Each tree is trained on a random subset of data (bagging) and a random subset of features at each split. The insight: many imperfect trees that disagree with each other make better predictions than one perfect tree that overfits.

Real numbers

A single decision tree might achieve 85% accuracy. A random forest of 500 trees often gets to 93-96% on the same dataset — because errors made by individual trees cancel out when averaged. This is ensemble learning.

Topic 04

Gradient Boosting: XGBoost & LightGBM

INTUITION

Boosting = fixing your mistakes

Imagine you're studying for an exam. After each practice test, you identify the questions you got wrong and focus extra time on those. Boosting does the same: each new tree focuses on the residual errors of the previous ensemble.

  1. Train a simple tree → make predictions → compute residuals (errors)
  2. Train next tree to predict the residuals
  3. Add this tree's predictions (scaled by learning rate) to the ensemble
  4. Repeat 100-1000 times
F_t(x) = F_{t-1}(x) + η × h_t(x)   (η = learning rate)
FeatureXGBoostLightGBMCatBoost
Split strategyLevel-wiseLeaf-wise (faster)Symmetric trees
Training speedFastFastestSlower
Categorical featuresManual encoding neededBuilt-in supportBest-in-class
Best forCompetitions, medium dataLarge datasetsHigh cardinality cats
When to use XGBoost

XGBoost / LightGBM dominates on tabular data. For every structured data problem, try these before any neural network. They win Kaggle competitions because they handle mixed types, missing values natively, and are extremely sample-efficient.

Topic 05

Support Vector Machines

SVM finds the maximum-margin hyperplane that separates two classes — the widest "road" between them. The data points closest to the boundary are called support vectors.

The kernel trick is SVMs' superpower: it implicitly maps data to a higher-dimensional space where a linear boundary exists. Common kernels: RBF (Gaussian), polynomial, linear.

When NOT to use SVM

SVMs scale as O(n²) to O(n³) with data size — unusable on datasets over ~100k samples. For large datasets, use logistic regression or gradient boosting instead.

Topic 06

Clustering: K-Means, DBSCAN & Hierarchical

K-MEANS

Centroid attraction

K-Means works like this: place k centroids randomly, assign each point to its nearest centroid, move centroids to the mean of their assigned points, repeat until convergence. The intuition: points cluster around their nearest "center of gravity."

  • Choose k: Elbow method (plot inertia vs k, pick the "elbow"), silhouette score
  • Limitation: Assumes spherical clusters of equal size. Fails on elongated or irregular shapes
  • DBSCAN: Finds clusters of arbitrary shape based on density. Better for noise/outliers

Topic 07

Feature Engineering: Turning Raw Data into Signal

Feature engineering is the art of creating new input variables from raw data that make patterns easier for the model to find. This is often where the biggest performance gains come from — better features beat better algorithms.

Raw FeatureEngineered FeaturesWhy it helps
timestamphour, day_of_week, is_weekend, days_since_last_purchaseCaptures temporal patterns
user_id + item_iduser_item_interaction_count, item_popularity_rankCaptures behavioral signals
latitude + longitudedistance_to_city_center, neighborhood_clusterCaptures spatial patterns
text reviewsentiment_score, word_count, exclamation_countExtracts semantic features
price + quantitytotal_value, price_per_unit, discount_percentageCaptures ratios and interactions

Topic 08

Model Evaluation: Cross-Validation & Metrics

GOLDEN RULE

Never evaluate on training data

If you evaluate a model on the same data it trained on, it will look perfect — because it's memorized it. Always hold out test data. Use k-fold cross-validation for small datasets: split data into k folds, train on k-1, evaluate on 1, rotate, average.

MetricUse whenFormula
AccuracyBalanced classes, all errors equal(TP+TN)/(TP+TN+FP+FN)
PrecisionFalse positives are costly (spam filter)TP/(TP+FP)
RecallFalse negatives are costly (cancer detection)TP/(TP+FN)
F1 ScoreImbalanced classes, balanced precision/recall2×P×R/(P+R)
AUC-ROCRanking quality, threshold-independentArea under ROC curve
RMSERegression, penalizes large errors more√(Σ(y-ŷ)²/n)
Interview question

Q: Your dataset has 1% fraud, 99% legit. Accuracy is 99%. Is the model good?
A: No — a model that predicts "legit" for everything achieves 99% accuracy. Use precision, recall, F1, or AUC-ROC on the minority class. Always look at the confusion matrix first.

Classical ML Decision Framework
  • Tabular data, interpretability needed → Logistic Regression or Decision Tree
  • Tabular data, best accuracy → XGBoost / LightGBM
  • High-dimensional text/image features → SVM with RBF kernel (small data) or neural net (large)
  • Finding natural groups in data → K-Means (spherical clusters) or DBSCAN (arbitrary shapes)
  • Reducing dimensionality → PCA (linear) or UMAP/t-SNE (non-linear, for visualization)