Topic 01

Why MLOps Exists

You've trained a model with 94% accuracy on your laptop. Congratulations. Now: how do you deploy it? How do you know if it degrades after deployment? How do you retrain it when new data arrives? How do you roll back if the new model performs worse? These are MLOps problems.

Traditional software: code goes through git → CI/CD → production. ML: code + data + model all need versioning, testing, and monitoring. A model that was accurate in January can silently fail in June if user behavior changes.

Topic 02

Experiment Tracking: MLflow & W&B

Have you ever trained 30 model variants and forgotten which hyperparameters gave the best result? Experiment tracking tools log every run automatically.

import mlflow import mlflow.sklearn mlflow.set_experiment("fraud-detection") with mlflow.start_run(run_name="xgboost-v2"): # Log hyperparameters mlflow.log_params({"n_estimators": 200, "max_depth": 6, "lr": 0.1}) model.fit(X_train, y_train) # Log metrics mlflow.log_metrics({ "auc": roc_auc_score(y_test, model.predict_proba(X_test)[:,1]), "precision": precision_score(y_test, model.predict(X_test)) }) # Save model artifact to registry mlflow.sklearn.log_model(model, "model", registered_model_name="FraudDetector")
FeatureMLflowWeights & Biases
HostingSelf-hosted (great for enterprises)Cloud SaaS
CostFree (infrastructure costs only)Free tier, paid plans
DashboardsBasicRich, interactive, beautiful
Model registryBuilt-inBuilt-in
Best forRegulated industries, on-premResearch teams, startups

Topic 03

Model Serving: FastAPI

from fastapi import FastAPI from pydantic import BaseModel import joblib, numpy as np app = FastAPI(title="Fraud Detection API", version="1.0") # Load once at startup — not on every request! model = joblib.load("model.pkl") class PredictRequest(BaseModel): amount: float hour_of_day: int is_international: bool class PredictResponse(BaseModel): fraud_probability: float is_fraud: bool @app.post("/predict", response_model=PredictResponse) async def predict(req: PredictRequest): features = np.array([[req.amount, req.hour_of_day, int(req.is_international)]]) prob = model.predict_proba(features)[0, 1] return PredictResponse(fraud_probability=float(prob), is_fraud=prob > 0.5) @app.get("/health") async def health(): return {"status": "ok"} # Run: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
FastAPI best practices
  • Load the model at startup using @app.on_event("startup") or at module level
  • Use Pydantic models for request/response — auto-validation + auto-docs
  • Add /health and /metrics endpoints for load balancer checks
  • Use async endpoints for I/O-bound work; sync for CPU-bound (model inference)

Topic 04

Docker for ML

Docker packages your model + dependencies + code into a reproducible container. "It works on my machine" is no longer an excuse.

# Dockerfile for ML API FROM python:3.11-slim AS base WORKDIR /app # Install dependencies first (Docker caches this layer) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy app code COPY . . # Run as non-root user for security RUN useradd --create-home appuser USER appuser EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Always use .dockerignore

Exclude from Docker build context: __pycache__, .git, *.pyc, data/, *.csv, large model files, .env secrets. Without it, Docker sends your entire project directory (potentially GBs) to the build daemon, making builds very slow.

Topic 05

CI/CD Pipelines for ML

# .github/workflows/ml-pipeline.yml name: ML Pipeline on: push: branches: [main] jobs: test-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: { python-version: '3.11' } - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: pytest tests/ -v --cov=src - name: Train model run: python train.py --output models/model.pkl - name: Evaluate model (GATE!) run: python evaluate.py --min-auc 0.85 - name: Build Docker image run: docker build -t fraud-api:${{ github.sha }} . - name: Deploy to production run: | aws ecs update-service --cluster prod \ --service fraud-api --force-new-deployment
The Model Gate — most important step

The evaluate.py --min-auc 0.85 step is crucial. It blocks deployment if the new model performs below the threshold. Without this, a bad commit that breaks feature preprocessing would silently deploy a worse model to production. Always gate on a minimum performance threshold.

Topic 06

Monitoring & Drift Detection

  • Data drift: Input distribution changes — fraud patterns shift seasonally, user demographics change. Detect by comparing feature distributions over time (KL divergence, PSI score)
  • Concept drift: The relationship between features and target changes — a feature that was predictive in 2024 may not be predictive in 2025
  • Prediction drift: Model outputs shift — more predictions in one class than expected
  • Performance degradation: Track business metrics: precision, recall, revenue impact
Monitoring checklist
  • Log every prediction with timestamp, input features, output, and ground truth (when available)
  • Track p-value of feature distribution differences (train vs recent production)
  • Set up alerts when AUC drops below threshold or prediction distribution shifts by >10%
  • Shadow mode: run new model in parallel with old before full switchover

Topic 07

Cloud ML Deployment

OptionBest forKey advantage
AWS SageMakerEnterprises, compliance-heavy industriesManaged everything: training, serving, monitoring, Feature Store
GCP Vertex AITeams using GCP, TPU access neededAutoML, Kubeflow Pipelines, best TPU ecosystem
ModalStartups, serverless GPU inferencePay per millisecond, zero infrastructure management
Hugging Face SpacesDemos, open-source modelsFree tier, GPU-backed, perfect for model demos
Railway / RenderSmall APIs, side projectsDeploy Docker containers in minutes, cheap
SageMaker spot instances — 70% savings

SageMaker training jobs on spot instances can save 60-70% on GPU costs. The risk: spot instances can be interrupted. Mitigate by enabling checkpointing every N steps — if interrupted, resume from last checkpoint. For training jobs longer than 1 hour, this is almost always worth it.