Back to Tracker

Classical Machine Learning

Regression · Trees · Boosting · SVM · Clustering · Evaluation

Core Module3 Weeks10 LessonsPrepflix AI Roadmap
ML Fundamentals: The Big Picture

Bias-Variance Tradeoff

Error = Bias² + Variance + Irreducible Noise
High Bias (Underfitting)Simple model, high train & test error
High Variance (Overfitting)Complex model, low train / high test error
Fix High BiasMore features, complex model, reduce regularization
Fix High VarianceMore data, regularization, simpler model, dropout

Regularization

L1 (Lasso): Loss + λΣ|wᵢ| → sparse weights
L2 (Ridge): Loss + λΣwᵢ² → small weights
ElasticNet: Loss + α·L1 + (1-α)·L2
L1 vs L2: L1 produces sparse models (some weights = 0, good for feature selection). L2 shrinks all weights towards zero. Use ElasticNet when features are correlated.
Linear & Logistic Regression
Regression

Linear Regression

ŷ = Xw + b
Loss = (1/n)‖y − Xw‖²
Closed form: w = (XᵀX)⁻¹Xᵀy

Assumptions: linear relationship, homoscedasticity, no multicollinearity, normally distributed residuals. Check with residual plots.

Classification

Logistic Regression

p = σ(Xw + b) = 1/(1+e^−(Xw+b))
Loss = −(1/n)Σ[y·log(p) + (1−y)·log(1−p)]

Output is a probability. Threshold at 0.5 by default but tune for imbalanced data. Use ROC-AUC to pick threshold.

Decision boundary is linear in feature space (use polynomial features for non-linear).
Decision Trees & Random Forests

Decision Tree Splitting

Gini: 1 − Σpᵢ² (classification)
Entropy: −Σpᵢ·log₂(pᵢ) (classification)
MSE: (1/n)Σ(yᵢ−ȳ)² (regression)

Pick split that maximizes information gain (entropy reduction) or minimizes weighted Gini.

Trees are non-parametric and handle non-linear relationships natively. But single trees overfit badly — that's why we ensemble them.

Random Forest Magic

  • Bagging: Each tree trained on bootstrap sample (~63% of data)
  • Feature subsampling: Each split considers √p features (reduces correlation between trees)
  • Aggregation: Average predictions (regression) or majority vote (classification)
  • OOB error: Free validation using the ~37% out-of-bag samples
  • Feature importance: Mean decrease in impurity across all trees
RF is almost always a strong baseline. Try it before XGBoost. Tune: n_estimators=200+, max_features='sqrt', min_samples_leaf=5.
Gradient Boosting: XGBoost & LightGBM

How Gradient Boosting Works

  1. Start with a constant prediction (mean of y)
  2. Compute residuals (negative gradient of loss)
  3. Fit a shallow tree to the residuals
  4. Add this tree × learning rate to model
  5. Repeat steps 2-4 for N iterations
F_m(x) = F_{m-1}(x) + α·h_m(x)

Each tree corrects errors of the previous ensemble.

XGBoost vs LightGBM

XGBoostLevel-wise tree growth, more regularization options
LightGBMLeaf-wise growth, 10× faster on large data
CatBoostBest for categorical features, no preprocessing needed
Key paramslearning_rate, n_estimators, max_depth, subsample, reg_alpha/lambda
Always use early stopping: early_stopping_rounds=50. Tune learning rate last — lower LR + more trees usually wins.
Support Vector Machines

Core Idea

Maximize margin: 2/‖w‖ subject to yᵢ(w·xᵢ + b) ≥ 1

Find the hyperplane that separates classes with maximum margin. Only the support vectors (points on the margin) matter.

Hard MarginPerfect separation, sensitive to outliers
Soft Margin (C)Allows misclassification. Large C = less regularization
Kernel TrickMap to high-dim without explicit computation

Kernels

LinearK(x,x') = xᵀx' (fast, high-dim data)
RBF/GaussianK = exp(−γ‖x−x'‖²) (most common)
PolynomialK = (γxᵀx' + r)^d
SVM scales poorly to large datasets (O(n²-n³)). Use for small/medium datasets (<100K) where margin maximization matters (text classification, bioinformatics).
Clustering Algorithms
K-Means
  • Assign points to nearest centroid
  • Update centroids to cluster means
  • Repeat until convergence
  • Sensitive to outliers, spherical clusters
  • Choose k: Elbow method or Silhouette score
Inertia = Σ‖xᵢ − μ_cluster‖²
DBSCAN
  • Density-based: no k needed
  • Finds arbitrary-shaped clusters
  • Marks outliers as noise (label = -1)
  • Params: eps (radius), min_samples
  • Great for geospatial data, anomaly detection
Hierarchical
  • Agglomerative (bottom-up) or Divisive
  • Produces dendrogram — cut at any level
  • No need to specify k in advance
  • Linkage: single, complete, average, ward
  • O(n²) — slow for large data
Model Evaluation: The Full Toolkit

Classification Metrics

Precision = TP/(TP+FP) ← when FP is costly
Recall = TP/(TP+FN) ← when FN is costly
F1 = 2·(P·R)/(P+R) ← harmonic mean
ROC-AUC: ranking ability across thresholds
PR-AUC: better for imbalanced datasets

Regression Metrics

MAE = (1/n)Σ|yᵢ−ŷᵢ| ← robust to outliers
RMSE = √((1/n)Σ(yᵢ−ŷᵢ)²) ← penalizes outliers
R² = 1 − SS_res/SS_tot ← 1 is perfect
MAPE = (1/n)Σ|yᵢ−ŷᵢ|/|yᵢ| × 100%

Cross-Validation Strategy

k-Fold CV

Split into k folds, train on k-1, validate on 1. Repeat k times. Standard choice: k=5.

Stratified k-Fold

Preserve class distribution in each fold. Always use for classification with imbalanced classes.

Time Series Split

Never shuffle! Train on past, validate on future. Use TimeSeriesSplit from sklearn.

Data Leakage: Fitting scaler/encoder on entire dataset before split leaks test info into training. Always fit preprocessing ONLY on train fold — use sklearn Pipelines to enforce this.