Classical ML — Prepflix Cheat Sheet

ML Fundamentals: The Big Picture

Bias-Variance Tradeoff

Error = Bias² + Variance + Irreducible Noise

High Bias (Underfitting)Simple model, high train & test error

High Variance (Overfitting)Complex model, low train / high test error

Fix High BiasMore features, complex model, reduce regularization

Fix High VarianceMore data, regularization, simpler model, dropout

Regularization

L1 (Lasso): Loss + λΣ|wᵢ| → sparse weights

L2 (Ridge): Loss + λΣwᵢ² → small weights

ElasticNet: Loss + α·L1 + (1-α)·L2

L1 vs L2: L1 produces sparse models (some weights = 0, good for feature selection). L2 shrinks all weights towards zero. Use ElasticNet when features are correlated.

Linear & Logistic Regression

Regression

Linear Regression

ŷ = Xw + b

Loss = (1/n)‖y − Xw‖²

Closed form: w = (XᵀX)⁻¹Xᵀy

Assumptions: linear relationship, homoscedasticity, no multicollinearity, normally distributed residuals. Check with residual plots.

Classification

Logistic Regression

p = σ(Xw + b) = 1/(1+e^−(Xw+b))

Loss = −(1/n)Σ[y·log(p) + (1−y)·log(1−p)]

Output is a probability. Threshold at 0.5 by default but tune for imbalanced data. Use ROC-AUC to pick threshold.

Decision boundary is linear in feature space (use polynomial features for non-linear).

Decision Trees & Random Forests

Decision Tree Splitting

Gini: 1 − Σpᵢ² (classification)

Entropy: −Σpᵢ·log₂(pᵢ) (classification)

MSE: (1/n)Σ(yᵢ−ȳ)² (regression)

Pick split that maximizes information gain (entropy reduction) or minimizes weighted Gini.

Trees are non-parametric and handle non-linear relationships natively. But single trees overfit badly — that's why we ensemble them.

Random Forest Magic

Bagging: Each tree trained on bootstrap sample (~63% of data)
Feature subsampling: Each split considers √p features (reduces correlation between trees)
Aggregation: Average predictions (regression) or majority vote (classification)
OOB error: Free validation using the ~37% out-of-bag samples
Feature importance: Mean decrease in impurity across all trees

RF is almost always a strong baseline. Try it before XGBoost. Tune: n_estimators=200+, max_features='sqrt', min_samples_leaf=5.

Gradient Boosting: XGBoost & LightGBM

How Gradient Boosting Works

Start with a constant prediction (mean of y)
Compute residuals (negative gradient of loss)
Fit a shallow tree to the residuals
Add this tree × learning rate to model
Repeat steps 2-4 for N iterations

F_m(x) = F_{m-1}(x) + α·h_m(x)

Each tree corrects errors of the previous ensemble.

XGBoost vs LightGBM

XGBoostLevel-wise tree growth, more regularization options

LightGBMLeaf-wise growth, 10× faster on large data

CatBoostBest for categorical features, no preprocessing needed

Key paramslearning_rate, n_estimators, max_depth, subsample, reg_alpha/lambda

Always use early stopping: early_stopping_rounds=50. Tune learning rate last — lower LR + more trees usually wins.

Support Vector Machines

Core Idea

Maximize margin: 2/‖w‖ subject to yᵢ(w·xᵢ + b) ≥ 1

Find the hyperplane that separates classes with maximum margin. Only the support vectors (points on the margin) matter.

Hard MarginPerfect separation, sensitive to outliers

Soft Margin (C)Allows misclassification. Large C = less regularization

Kernel TrickMap to high-dim without explicit computation

Kernels

LinearK(x,x') = xᵀx' (fast, high-dim data)

RBF/GaussianK = exp(−γ‖x−x'‖²) (most common)

PolynomialK = (γxᵀx' + r)^d

SVM scales poorly to large datasets (O(n²-n³)). Use for small/medium datasets (<100K) where margin maximization matters (text classification, bioinformatics).

Clustering Algorithms

K-Means

Assign points to nearest centroid
Update centroids to cluster means
Repeat until convergence
Sensitive to outliers, spherical clusters
Choose k: Elbow method or Silhouette score

Inertia = Σ‖xᵢ − μ_cluster‖²

DBSCAN

Density-based: no k needed
Finds arbitrary-shaped clusters
Marks outliers as noise (label = -1)
Params: eps (radius), min_samples
Great for geospatial data, anomaly detection

Hierarchical

Agglomerative (bottom-up) or Divisive
Produces dendrogram — cut at any level
No need to specify k in advance
Linkage: single, complete, average, ward
O(n²) — slow for large data

Model Evaluation: The Full Toolkit

Classification Metrics

Precision = TP/(TP+FP) ← when FP is costly

Recall = TP/(TP+FN) ← when FN is costly

F1 = 2·(P·R)/(P+R) ← harmonic mean

ROC-AUC: ranking ability across thresholds

PR-AUC: better for imbalanced datasets

Regression Metrics

MAE = (1/n)Σ|yᵢ−ŷᵢ| ← robust to outliers

RMSE = √((1/n)Σ(yᵢ−ŷᵢ)²) ← penalizes outliers

R² = 1 − SS_res/SS_tot ← 1 is perfect

MAPE = (1/n)Σ|yᵢ−ŷᵢ|/|yᵢ| × 100%

Cross-Validation Strategy

k-Fold CV

Split into k folds, train on k-1, validate on 1. Repeat k times. Standard choice: k=5.

Stratified k-Fold

Preserve class distribution in each fold. Always use for classification with imbalanced classes.

Time Series Split

Never shuffle! Train on past, validate on future. Use TimeSeriesSplit from sklearn.

Data Leakage: Fitting scaler/encoder on entire dataset before split leaks test info into training. Always fit preprocessing ONLY on train fold — use sklearn Pipelines to enforce this.

Classical Machine Learning

Bias-Variance Tradeoff

Regularization

Linear Regression

Logistic Regression

Decision Tree Splitting

Random Forest Magic

How Gradient Boosting Works

XGBoost vs LightGBM

Core Idea

Kernels

Classification Metrics

Regression Metrics

Cross-Validation Strategy