Back to Tracker

Computer Vision

OpenCV · CNN Architectures · Object Detection · Segmentation · GANs

Specialization2 Weeks5 LessonsPrepflix AI Roadmap
Image Processing & OpenCV

Essential OpenCV Operations

import cv2 import numpy as np img = cv2.imread('photo.jpg') # BGR, not RGB! rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) gray= cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Resize resized = cv2.resize(img, (224, 224)) # ImageNet standard # Filters blurred = cv2.GaussianBlur(img, (5,5), 0) edges = cv2.Canny(gray, 50, 150) # Normalization (for DL models) img_norm = img.astype(np.float32) / 255.0 mean = [0.485, 0.456, 0.406] # ImageNet mean std = [0.229, 0.224, 0.225] # ImageNet std

Data Augmentation (torchvision)

from torchvision import transforms train_tfm = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter(.4,.4,.4), transforms.ToTensor(), transforms.Normalize(mean, std) ]) val_tfm = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean, std) ])
Always augment training data but never validation/test data (except resize/crop/normalize).
CNN Architectures
ResNet-50
  • 50 layers with residual connections
  • Bottleneck blocks: 1×1, 3×3, 1×1
  • Input: 224×224×3
  • Top-1 ImageNet: ~76%
  • Best for: feature extraction, transfer learning baseline
EfficientNetB0-B7
  • Compound scaling: depth+width+resolution
  • B0: 5.3M params, B7: 66M params
  • State-of-art accuracy/efficiency
  • Best for: production models with size constraints
ViT (Vision Transformer)
  • Split image into 16×16 patches
  • Flatten + positional embedding
  • Standard Transformer encoder
  • Needs large data (or pretraining)
  • Best for: large datasets, fine-grained recognition

Transfer Learning Recipe

import torchvision.models as models model = models.resnet50(pretrained=True) # Freeze backbone for param in model.parameters(): param.requires_grad = False # Replace head with your task's classes model.fc = nn.Linear(model.fc.in_features, num_classes) # Phase 1: train head only (5-10 epochs) # Phase 2: unfreeze last 2 blocks, train with LR/10 for param in model.layer4.parameters(): param.requires_grad = True
Object Detection

YOLO (You Only Look Once)

  • Single-stage detector — one forward pass
  • Grid divides image → each cell predicts boxes + class
  • YOLOv8/v10/v11: state-of-art speed-accuracy tradeoff
  • Best for: real-time applications
from ultralytics import YOLO model = YOLO('yolov8n.pt') # nano variant results = model.predict('image.jpg') # inference model.train(data='coco.yaml', epochs=50)

Key Metrics

IoU = Intersection Area / Union Area
IoU ≥ 0.5True positive (COCO standard)
mAP@0.5Mean AP across classes at IoU=0.5
mAP@0.5:0.95COCO metric: averaged over IoU thresholds
NMSNon-max suppression: removes overlapping boxes
Anchor-freeModern YOLO versions, simpler pipeline
GANs & Diffusion Models

GAN Framework

min_G max_D E[log D(x)] + E[log(1−D(G(z)))]
  • Generator G: maps noise z → fake image
  • Discriminator D: real vs fake classifier
  • Training: alternate D steps and G steps
  • Mode collapse, training instability are key challenges
  • Metrics: FID (Frechet Inception Distance), IS
DCGANConv-based GAN, stable training
StyleGAN2/3SOTA face synthesis, style control
Pix2Pix, CycleGANImage-to-image translation

Diffusion Models (DDPM)

  1. Forward process: Gradually add Gaussian noise to image over T steps
  2. Reverse process: Train UNet to predict noise at each step
  3. Generation: Start from pure noise, denoise T times
Stable Diffusion: Latent Diffusion Model (LDM) — works in compressed latent space (8× smaller), making training and inference much faster. Add CLIP text encoder for text conditioning.
DDPM1000 steps, slow but high quality
DDIM20-50 steps, deterministic, faster