Classical Machine Learning: Linear Regression to XGBoost

01Bias-Variance Tradeoff

02Linear & Logistic Regression

03Decision Trees & Random Forests

04Gradient Boosting & XGBoost

05Support Vector Machines

06Clustering

07Feature Engineering

08Model Evaluation

Classical ML algorithms still dominate in production for tabular data — which represents 80% of real-world ML jobs. Every data scientist at Google, Amazon, and every fintech company uses these algorithms daily. And every ML interview starts here.

Topic 01

Bias-Variance Tradeoff

ANALOGY

The archer analogy

Imagine four archers shooting at a target:

High bias, low variance: Arrows cluster tightly, but far from the bullseye — consistently wrong. (Underfitting)
Low bias, high variance: Arrows spread all over, centered around the bullseye — sometimes right, often wrong. (Overfitting)
High bias, high variance: The worst — spread out AND off-center
Low bias, low variance: Tight cluster at the bullseye — what we want

The archer analogy: bias = systematic error, variance = inconsistency

Total Error = Bias² + Variance + Irreducible Noise

High bias → underfitting: Model too simple (linear model on non-linear data). Fix: more model complexity, more features
High variance → overfitting: Model memorizes training data. Fix: regularization, more data, early stopping, dropout

Topic 02

Linear & Logistic Regression

LINEAR

Linear Regression: fitting a line

Linear regression predicts a continuous value by finding the best-fit line through your data. "Best fit" = minimizes the sum of squared errors (MSE).

ŷ = w₁x₁ + w₂x₂ + ... + b | Loss = (1/n)Σ(y - ŷ)²

Assumes: Linear relationship, no multicollinearity, homoscedastic residuals
Solution: Closed-form (Normal Equation) for small datasets, gradient descent for large
Regularization: L2 (Ridge) penalizes large weights, L1 (Lasso) can zero out features

LOGISTIC

Logistic Regression: why sigmoid?

For binary classification, we need output between 0 and 1 (a probability). The sigmoid function squashes any real number to (0, 1):

σ(z) = 1 / (1 + e⁻ᶻ) | z = w·x + b

Key Insight

Logistic regression predicts log-odds, not probability directly. "Log-odds of 0" → 50% probability. "Log-odds of 2" → 88%. Despite the name, it's a classification algorithm, not regression. The decision boundary is where σ(z) = 0.5, i.e., z = 0.

Topic 03

Decision Trees & Random Forests

TREES

How decision trees make splits

A decision tree asks a sequence of yes/no questions to classify a data point. Each split is chosen to maximize information gain (how much uncertainty is reduced by the split).

Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified. Gini = 1 - Σpᵢ²
Information Gain: Entropy before split - weighted entropy after split
Stopping criteria: Max depth, min samples per leaf, min information gain

The overfit problem

A fully-grown decision tree memorizes the training data perfectly (it can have one leaf per training sample). Always use max_depth or min_samples_leaf to regularize. A tree with depth 2-4 is often more generalizable than depth 20.

FORESTS

Random Forest: wisdom of crowds

Random forests build many decision trees and average their predictions. Each tree is trained on a random subset of data (bagging) and a random subset of features at each split. The insight: many imperfect trees that disagree with each other make better predictions than one perfect tree that overfits.

Real numbers

A single decision tree might achieve 85% accuracy. A random forest of 500 trees often gets to 93-96% on the same dataset — because errors made by individual trees cancel out when averaged. This is ensemble learning.

Topic 04

Gradient Boosting: XGBoost & LightGBM

INTUITION

Boosting = fixing your mistakes

Imagine you're studying for an exam. After each practice test, you identify the questions you got wrong and focus extra time on those. Boosting does the same: each new tree focuses on the residual errors of the previous ensemble.

Train a simple tree → make predictions → compute residuals (errors)
Train next tree to predict the residuals
Add this tree's predictions (scaled by learning rate) to the ensemble
Repeat 100-1000 times

F_t(x) = F_{t-1}(x) + η × h_t(x) (η = learning rate)

Feature	XGBoost	LightGBM	CatBoost
Split strategy	Level-wise	Leaf-wise (faster)	Symmetric trees
Training speed	Fast	Fastest	Slower
Categorical features	Manual encoding needed	Built-in support	Best-in-class
Best for	Competitions, medium data	Large datasets	High cardinality cats

When to use XGBoost

XGBoost / LightGBM dominates on tabular data. For every structured data problem, try these before any neural network. They win Kaggle competitions because they handle mixed types, missing values natively, and are extremely sample-efficient.

Topic 05

Support Vector Machines

SVM finds the maximum-margin hyperplane that separates two classes — the widest "road" between them. The data points closest to the boundary are called support vectors.

The kernel trick is SVMs' superpower: it implicitly maps data to a higher-dimensional space where a linear boundary exists. Common kernels: RBF (Gaussian), polynomial, linear.

When NOT to use SVM

SVMs scale as O(n²) to O(n³) with data size — unusable on datasets over ~100k samples. For large datasets, use logistic regression or gradient boosting instead.

Topic 06

Clustering: K-Means, DBSCAN & Hierarchical

K-MEANS

Centroid attraction

K-Means works like this: place k centroids randomly, assign each point to its nearest centroid, move centroids to the mean of their assigned points, repeat until convergence. The intuition: points cluster around their nearest "center of gravity."

Choose k: Elbow method (plot inertia vs k, pick the "elbow"), silhouette score
Limitation: Assumes spherical clusters of equal size. Fails on elongated or irregular shapes
DBSCAN: Finds clusters of arbitrary shape based on density. Better for noise/outliers

Topic 07

Feature Engineering: Turning Raw Data into Signal

Feature engineering is the art of creating new input variables from raw data that make patterns easier for the model to find. This is often where the biggest performance gains come from — better features beat better algorithms.

Raw Feature	Engineered Features	Why it helps
timestamp	hour, day_of_week, is_weekend, days_since_last_purchase	Captures temporal patterns
user_id + item_id	user_item_interaction_count, item_popularity_rank	Captures behavioral signals
latitude + longitude	distance_to_city_center, neighborhood_cluster	Captures spatial patterns
text review	sentiment_score, word_count, exclamation_count	Extracts semantic features
price + quantity	total_value, price_per_unit, discount_percentage	Captures ratios and interactions

Topic 08

Model Evaluation: Cross-Validation & Metrics

GOLDEN RULE

Never evaluate on training data

If you evaluate a model on the same data it trained on, it will look perfect — because it's memorized it. Always hold out test data. Use k-fold cross-validation for small datasets: split data into k folds, train on k-1, evaluate on 1, rotate, average.

Metric	Use when	Formula
Accuracy	Balanced classes, all errors equal	(TP+TN)/(TP+TN+FP+FN)
Precision	False positives are costly (spam filter)	TP/(TP+FP)
Recall	False negatives are costly (cancer detection)	TP/(TP+FN)
F1 Score	Imbalanced classes, balanced precision/recall	2×P×R/(P+R)
AUC-ROC	Ranking quality, threshold-independent	Area under ROC curve
RMSE	Regression, penalizes large errors more	√(Σ(y-ŷ)²/n)

Interview question

Q: Your dataset has 1% fraud, 99% legit. Accuracy is 99%. Is the model good?
A: No — a model that predicts "legit" for everything achieves 99% accuracy. Use precision, recall, F1, or AUC-ROC on the minority class. Always look at the confusion matrix first.

Classical ML Decision Framework

Tabular data, interpretability needed → Logistic Regression or Decision Tree
Tabular data, best accuracy → XGBoost / LightGBM
High-dimensional text/image features → SVM with RBF kernel (small data) or neural net (large)
Finding natural groups in data → K-Means (spherical clusters) or DBSCAN (arbitrary shapes)
Reducing dimensionality → PCA (linear) or UMAP/t-SNE (non-linear, for visualization)

Classical Machine Learning
From Linear Regression to XGBoost — Complete Guide

Contents