Topic 01

What Is an Image to a Computer?

A color image is a 3D array of shape (Height, Width, Channels). A 640×480 RGB image has 640×480×3 = 921,600 numbers. Each pixel has three values: Red, Green, Blue — each between 0 (dark) and 255 (bright).

R channel G channel B channel Combined pixel H × W × 3 tensor: each value 0-255

RGB image = 3 channel matrices stacked — each pixel is (R, G, B) triple

Key Insight

Before feeding images to neural networks, normalize pixel values to [0, 1] or [-1, 1]. This prevents gradient explosion and makes training 10× more stable. Also subtract the dataset mean and divide by std per channel (ImageNet normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]).

Topic 02

OpenCV Fundamentals

OpenCV is the industry-standard library for image processing — used in production at Google, Meta, and every robotics company. Essential operations:

import cv2 import numpy as np # Read and display img = cv2.imread('image.jpg') # BGR format (not RGB!) img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Resize resized = cv2.resize(img, (224, 224)) # Edge detection (Canny) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) edges = cv2.Canny(gray, threshold1=100, threshold2=200) # Gaussian blur (noise reduction) blurred = cv2.GaussianBlur(img, (5, 5), sigmaX=0) # Drawing (for visualization) cv2.rectangle(img, (x1,y1), (x2,y2), color=(0,255,0), thickness=2) cv2.putText(img, "Dog 0.94", (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2)
OpenCV gotcha

OpenCV reads images as BGR (Blue-Green-Red), not RGB. If you feed an OpenCV image directly to a model trained on RGB data (PyTorch, torchvision), colors will be wrong. Always convert: cv2.cvtColor(img, cv2.COLOR_BGR2RGB) before preprocessing.

Topic 03

The Convolution Operation

A convolution filter (kernel) is a small matrix (e.g., 3×3) that slides across the image. At each position, it computes the dot product of the filter and the image patch beneath it. This produces a feature map.

Input (5×5) × Filter (3×3) -1 0 1 -2 0 2 -1 0 1 Sobel edge filter = 42 feature

Convolution: sliding a 3×3 filter over the image, computing dot products → feature map

Each filter detects a specific pattern. Early layers in a trained CNN have filters that look like edge detectors (similar to the Sobel filter above). Deeper layers have filters that respond to complex textures, then parts of objects, then whole objects.

Topic 04

CNN Architectures: ResNet, EfficientNet & ViT

ArchitectureKey InnovationBest forImageNet Top-1
ResNet-50Skip connections (highway bypasses)General vision, fine-tuning baseline76%
EfficientNet-B4Compound scaling (width+depth+resolution)Best accuracy/compute tradeoff83%
ViT-BaseTreat image patches as tokens, use transformersLarge-scale datasets, SOTA accuracy81%
ConvNeXtModernized ResNet with transformer design choicesProduction, matches ViT with fewer resources82%
RESNET

Skip Connections: the highway bypass

Very deep networks (100+ layers) fail to train — gradients vanish before reaching early layers. ResNet's solution: add a shortcut connection that bypasses layers. If a layer learns nothing useful, its output is just the input passed through unchanged. This "identity shortcut" lets gradients flow directly through deep networks.

Topic 05

Object Detection: YOLO & Faster R-CNN

Object detection = find every object in an image AND say what it is. Output: bounding boxes [x, y, w, h] + class labels + confidence scores.

ApproachSpeedAccuracyHow it works
YOLO (v5/v8)Real-time (45+ FPS)GoodSingle forward pass — divides image into grid, each cell predicts boxes
Faster R-CNNSlower (5-7 FPS)BetterTwo stages: propose regions, then classify each region
SSDFast (30+ FPS)MediumSingle pass, multi-scale feature maps for different object sizes
DETRMediumExcellentTransformer-based end-to-end detection, no anchor boxes needed
When to use YOLO

YOLO is the default choice for real-time applications: autonomous vehicles, security cameras, sports tracking. For medical imaging or when false positives are very costly, use Faster R-CNN — the extra latency is worth the accuracy. YOLO v8 (by Ultralytics) is production-ready with 3 lines of Python.

Topic 06

Image Segmentation

  • Semantic segmentation: Classify every pixel with a class label (road/car/person/sky). All cars are one class — can't distinguish individual cars
  • Instance segmentation: Detect AND segment each individual object. Car #1 vs Car #2 are different instances. Mask R-CNN is the standard approach
  • Panoptic segmentation: Combines both — semantic for background (road, sky) + instance for foreground objects

Topic 07

GANs & Diffusion Models

GANs

Art forger vs art critic

GANs have two networks: a Generator (creates fake images) and a Discriminator (distinguishes real from fake). They train against each other: the generator gets better at fooling the discriminator, which gets better at detecting fakes. Equilibrium = photorealistic images.

DIFFUSION

Learn to denoise → generate

Diffusion models add Gaussian noise to images over 1000 steps until it's pure noise. Then train a neural network to reverse this process — denoise step by step. At inference: start with pure noise, apply the denoiser 1000 times → coherent image. This is how Stable Diffusion, DALL·E 3, and Midjourney work.

Why diffusion replaced GANs

GANs are notoriously hard to train (mode collapse, training instability). Diffusion models are more stable to train and produce better diversity. However, they're 10-100× slower at inference. Current research (consistency models, flow matching) is closing this gap.

Topic 08

Computer Vision Metrics

MetricTaskWhat it measures
Top-1 / Top-5 AccuracyClassificationIs correct label in top 1/5 predictions?
IoUDetection/SegmentationIntersection over Union of predicted vs ground-truth box
mAPObject DetectionMean Average Precision across all classes and IoU thresholds
FIDImage GenerationFréchet Inception Distance — lower = more realistic generated images
SSIMImage QualityStructural Similarity — matches human perception of image quality