Contents
Classical ML algorithms still dominate in production for tabular data — which represents 80% of real-world ML jobs. Every data scientist at Google, Amazon, and every fintech company uses these algorithms daily. And every ML interview starts here.
Topic 01
Bias-Variance Tradeoff
The archer analogy
Imagine four archers shooting at a target:
- High bias, low variance: Arrows cluster tightly, but far from the bullseye — consistently wrong. (Underfitting)
- Low bias, high variance: Arrows spread all over, centered around the bullseye — sometimes right, often wrong. (Overfitting)
- High bias, high variance: The worst — spread out AND off-center
- Low bias, low variance: Tight cluster at the bullseye — what we want
The archer analogy: bias = systematic error, variance = inconsistency
- High bias → underfitting: Model too simple (linear model on non-linear data). Fix: more model complexity, more features
- High variance → overfitting: Model memorizes training data. Fix: regularization, more data, early stopping, dropout
Topic 02
Linear & Logistic Regression
Linear Regression: fitting a line
Linear regression predicts a continuous value by finding the best-fit line through your data. "Best fit" = minimizes the sum of squared errors (MSE).
- Assumes: Linear relationship, no multicollinearity, homoscedastic residuals
- Solution: Closed-form (Normal Equation) for small datasets, gradient descent for large
- Regularization: L2 (Ridge) penalizes large weights, L1 (Lasso) can zero out features
Logistic Regression: why sigmoid?
For binary classification, we need output between 0 and 1 (a probability). The sigmoid function squashes any real number to (0, 1):
Logistic regression predicts log-odds, not probability directly. "Log-odds of 0" → 50% probability. "Log-odds of 2" → 88%. Despite the name, it's a classification algorithm, not regression. The decision boundary is where σ(z) = 0.5, i.e., z = 0.
Topic 03
Decision Trees & Random Forests
How decision trees make splits
A decision tree asks a sequence of yes/no questions to classify a data point. Each split is chosen to maximize information gain (how much uncertainty is reduced by the split).
- Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified. Gini = 1 - Σpᵢ²
- Information Gain: Entropy before split - weighted entropy after split
- Stopping criteria: Max depth, min samples per leaf, min information gain
A fully-grown decision tree memorizes the training data perfectly (it can have one leaf per training sample). Always use max_depth or min_samples_leaf to regularize. A tree with depth 2-4 is often more generalizable than depth 20.
Random Forest: wisdom of crowds
Random forests build many decision trees and average their predictions. Each tree is trained on a random subset of data (bagging) and a random subset of features at each split. The insight: many imperfect trees that disagree with each other make better predictions than one perfect tree that overfits.
A single decision tree might achieve 85% accuracy. A random forest of 500 trees often gets to 93-96% on the same dataset — because errors made by individual trees cancel out when averaged. This is ensemble learning.
Topic 04
Gradient Boosting: XGBoost & LightGBM
Boosting = fixing your mistakes
Imagine you're studying for an exam. After each practice test, you identify the questions you got wrong and focus extra time on those. Boosting does the same: each new tree focuses on the residual errors of the previous ensemble.
- Train a simple tree → make predictions → compute residuals (errors)
- Train next tree to predict the residuals
- Add this tree's predictions (scaled by learning rate) to the ensemble
- Repeat 100-1000 times
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Split strategy | Level-wise | Leaf-wise (faster) | Symmetric trees |
| Training speed | Fast | Fastest | Slower |
| Categorical features | Manual encoding needed | Built-in support | Best-in-class |
| Best for | Competitions, medium data | Large datasets | High cardinality cats |
XGBoost / LightGBM dominates on tabular data. For every structured data problem, try these before any neural network. They win Kaggle competitions because they handle mixed types, missing values natively, and are extremely sample-efficient.
Topic 05
Support Vector Machines
SVM finds the maximum-margin hyperplane that separates two classes — the widest "road" between them. The data points closest to the boundary are called support vectors.
The kernel trick is SVMs' superpower: it implicitly maps data to a higher-dimensional space where a linear boundary exists. Common kernels: RBF (Gaussian), polynomial, linear.
SVMs scale as O(n²) to O(n³) with data size — unusable on datasets over ~100k samples. For large datasets, use logistic regression or gradient boosting instead.
Topic 06
Clustering: K-Means, DBSCAN & Hierarchical
Centroid attraction
K-Means works like this: place k centroids randomly, assign each point to its nearest centroid, move centroids to the mean of their assigned points, repeat until convergence. The intuition: points cluster around their nearest "center of gravity."
- Choose k: Elbow method (plot inertia vs k, pick the "elbow"), silhouette score
- Limitation: Assumes spherical clusters of equal size. Fails on elongated or irregular shapes
- DBSCAN: Finds clusters of arbitrary shape based on density. Better for noise/outliers
Topic 07
Feature Engineering: Turning Raw Data into Signal
Feature engineering is the art of creating new input variables from raw data that make patterns easier for the model to find. This is often where the biggest performance gains come from — better features beat better algorithms.
| Raw Feature | Engineered Features | Why it helps |
|---|---|---|
| timestamp | hour, day_of_week, is_weekend, days_since_last_purchase | Captures temporal patterns |
| user_id + item_id | user_item_interaction_count, item_popularity_rank | Captures behavioral signals |
| latitude + longitude | distance_to_city_center, neighborhood_cluster | Captures spatial patterns |
| text review | sentiment_score, word_count, exclamation_count | Extracts semantic features |
| price + quantity | total_value, price_per_unit, discount_percentage | Captures ratios and interactions |
Topic 08
Model Evaluation: Cross-Validation & Metrics
Never evaluate on training data
If you evaluate a model on the same data it trained on, it will look perfect — because it's memorized it. Always hold out test data. Use k-fold cross-validation for small datasets: split data into k folds, train on k-1, evaluate on 1, rotate, average.
| Metric | Use when | Formula |
|---|---|---|
| Accuracy | Balanced classes, all errors equal | (TP+TN)/(TP+TN+FP+FN) |
| Precision | False positives are costly (spam filter) | TP/(TP+FP) |
| Recall | False negatives are costly (cancer detection) | TP/(TP+FN) |
| F1 Score | Imbalanced classes, balanced precision/recall | 2×P×R/(P+R) |
| AUC-ROC | Ranking quality, threshold-independent | Area under ROC curve |
| RMSE | Regression, penalizes large errors more | √(Σ(y-ŷ)²/n) |
Q: Your dataset has 1% fraud, 99% legit. Accuracy is 99%. Is the model good?
A: No — a model that predicts "legit" for everything achieves 99% accuracy. Use precision, recall, F1, or AUC-ROC on the minority class. Always look at the confusion matrix first.
- Tabular data, interpretability needed → Logistic Regression or Decision Tree
- Tabular data, best accuracy → XGBoost / LightGBM
- High-dimensional text/image features → SVM with RBF kernel (small data) or neural net (large)
- Finding natural groups in data → K-Means (spherical clusters) or DBSCAN (arbitrary shapes)
- Reducing dimensionality → PCA (linear) or UMAP/t-SNE (non-linear, for visualization)