Computer Vision — Prepflix Cheat Sheet

Image Processing & OpenCV

Essential OpenCV Operations

import cv2 import numpy as np img = cv2.imread('photo.jpg') # BGR, not RGB! rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) gray= cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Resize resized = cv2.resize(img, (224, 224)) # ImageNet standard # Filters blurred = cv2.GaussianBlur(img, (5,5), 0) edges = cv2.Canny(gray, 50, 150) # Normalization (for DL models) img_norm = img.astype(np.float32) / 255.0 mean = [0.485, 0.456, 0.406] # ImageNet mean std = [0.229, 0.224, 0.225] # ImageNet std

Data Augmentation (torchvision)

from torchvision import transforms train_tfm = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter(.4,.4,.4), transforms.ToTensor(), transforms.Normalize(mean, std) ]) val_tfm = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean, std) ])

Always augment training data but never validation/test data (except resize/crop/normalize).

CNN Architectures

ResNet-50

50 layers with residual connections
Bottleneck blocks: 1×1, 3×3, 1×1
Input: 224×224×3
Top-1 ImageNet: ~76%
Best for: feature extraction, transfer learning baseline

EfficientNetB0-B7

Compound scaling: depth+width+resolution
B0: 5.3M params, B7: 66M params
State-of-art accuracy/efficiency
Best for: production models with size constraints

ViT (Vision Transformer)

Split image into 16×16 patches
Flatten + positional embedding
Standard Transformer encoder
Needs large data (or pretraining)
Best for: large datasets, fine-grained recognition

Transfer Learning Recipe

import torchvision.models as models model = models.resnet50(pretrained=True) # Freeze backbone for param in model.parameters(): param.requires_grad = False # Replace head with your task's classes model.fc = nn.Linear(model.fc.in_features, num_classes) # Phase 1: train head only (5-10 epochs) # Phase 2: unfreeze last 2 blocks, train with LR/10 for param in model.layer4.parameters(): param.requires_grad = True

Object Detection

YOLO (You Only Look Once)

Single-stage detector — one forward pass
Grid divides image → each cell predicts boxes + class
YOLOv8/v10/v11: state-of-art speed-accuracy tradeoff
Best for: real-time applications

from ultralytics import YOLO model = YOLO('yolov8n.pt') # nano variant results = model.predict('image.jpg') # inference model.train(data='coco.yaml', epochs=50)

Key Metrics

IoU = Intersection Area / Union Area

IoU ≥ 0.5True positive (COCO standard)

mAP@0.5Mean AP across classes at IoU=0.5

mAP@0.5:0.95COCO metric: averaged over IoU thresholds

NMSNon-max suppression: removes overlapping boxes

Anchor-freeModern YOLO versions, simpler pipeline

GANs & Diffusion Models

GAN Framework

min_G max_D E[log D(x)] + E[log(1−D(G(z)))]

Generator G: maps noise z → fake image
Discriminator D: real vs fake classifier
Training: alternate D steps and G steps
Mode collapse, training instability are key challenges
Metrics: FID (Frechet Inception Distance), IS

DCGANConv-based GAN, stable training

StyleGAN2/3SOTA face synthesis, style control

Pix2Pix, CycleGANImage-to-image translation

Diffusion Models (DDPM)

Forward process: Gradually add Gaussian noise to image over T steps
Reverse process: Train UNet to predict noise at each step
Generation: Start from pure noise, denoise T times

Stable Diffusion: Latent Diffusion Model (LDM) — works in compressed latent space (8× smaller), making training and inference much faster. Add CLIP text encoder for text conditioning.

DDPM1000 steps, slow but high quality

DDIM20-50 steps, deterministic, faster