Computer Vision: OpenCV, CNNs, YOLO & GANs

01Images as Arrays

02OpenCV Fundamentals

03Convolution Operation

Topic 01

What Is an Image to a Computer?

A color image is a 3D array of shape (Height, Width, Channels). A 640×480 RGB image has 640×480×3 = 921,600 numbers. Each pixel has three values: Red, Green, Blue — each between 0 (dark) and 255 (bright).

RGB image = 3 channel matrices stacked — each pixel is (R, G, B) triple

Key Insight

Before feeding images to neural networks, normalize pixel values to [0, 1] or [-1, 1]. This prevents gradient explosion and makes training 10× more stable. Also subtract the dataset mean and divide by std per channel (ImageNet normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]).

Topic 02

OpenCV Fundamentals

OpenCV is the industry-standard library for image processing — used in production at Google, Meta, and every robotics company. Essential operations:

import cv2
import numpy as np

# Read and display
img = cv2.imread('image.jpg')           # BGR format (not RGB!)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Resize
resized = cv2.resize(img, (224, 224))

# Edge detection (Canny)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, threshold1=100, threshold2=200)

# Gaussian blur (noise reduction)
blurred = cv2.GaussianBlur(img, (5, 5), sigmaX=0)

# Drawing (for visualization)
cv2.rectangle(img, (x1,y1), (x2,y2), color=(0,255,0), thickness=2)
cv2.putText(img, "Dog 0.94", (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2)
            

OpenCV gotcha

OpenCV reads images as BGR (Blue-Green-Red), not RGB. If you feed an OpenCV image directly to a model trained on RGB data (PyTorch, torchvision), colors will be wrong. Always convert: cv2.cvtColor(img, cv2.COLOR_BGR2RGB) before preprocessing.

Topic 03

The Convolution Operation

A convolution filter (kernel) is a small matrix (e.g., 3×3) that slides across the image. At each position, it computes the dot product of the filter and the image patch beneath it. This produces a feature map.

Convolution: sliding a 3×3 filter over the image, computing dot products → feature map

Each filter detects a specific pattern. Early layers in a trained CNN have filters that look like edge detectors (similar to the Sobel filter above). Deeper layers have filters that respond to complex textures, then parts of objects, then whole objects.

Topic 04

CNN Architectures: ResNet, EfficientNet & ViT

Architecture	Key Innovation	Best for	ImageNet Top-1
ResNet-50	Skip connections (highway bypasses)	General vision, fine-tuning baseline	76%
EfficientNet-B4	Compound scaling (width+depth+resolution)	Best accuracy/compute tradeoff	83%
ViT-Base	Treat image patches as tokens, use transformers	Large-scale datasets, SOTA accuracy	81%
ConvNeXt	Modernized ResNet with transformer design choices	Production, matches ViT with fewer resources	82%

RESNET

Skip Connections: the highway bypass

Very deep networks (100+ layers) fail to train — gradients vanish before reaching early layers. ResNet's solution: add a shortcut connection that bypasses layers. If a layer learns nothing useful, its output is just the input passed through unchanged. This "identity shortcut" lets gradients flow directly through deep networks.

Topic 05

Object Detection: YOLO & Faster R-CNN

Object detection = find every object in an image AND say what it is. Output: bounding boxes [x, y, w, h] + class labels + confidence scores.

Approach	Speed	Accuracy	How it works
YOLO (v5/v8)	Real-time (45+ FPS)	Good	Single forward pass — divides image into grid, each cell predicts boxes
Faster R-CNN	Slower (5-7 FPS)	Better	Two stages: propose regions, then classify each region
SSD	Fast (30+ FPS)	Medium	Single pass, multi-scale feature maps for different object sizes
DETR	Medium	Excellent	Transformer-based end-to-end detection, no anchor boxes needed

When to use YOLO

YOLO is the default choice for real-time applications: autonomous vehicles, security cameras, sports tracking. For medical imaging or when false positives are very costly, use Faster R-CNN — the extra latency is worth the accuracy. YOLO v8 (by Ultralytics) is production-ready with 3 lines of Python.

Topic 06

Image Segmentation

Semantic segmentation: Classify every pixel with a class label (road/car/person/sky). All cars are one class — can't distinguish individual cars
Instance segmentation: Detect AND segment each individual object. Car #1 vs Car #2 are different instances. Mask R-CNN is the standard approach
Panoptic segmentation: Combines both — semantic for background (road, sky) + instance for foreground objects

Topic 07

GANs & Diffusion Models

GANs

Art forger vs art critic

GANs have two networks: a Generator (creates fake images) and a Discriminator (distinguishes real from fake). They train against each other: the generator gets better at fooling the discriminator, which gets better at detecting fakes. Equilibrium = photorealistic images.

DIFFUSION

Learn to denoise → generate

Diffusion models add Gaussian noise to images over 1000 steps until it's pure noise. Then train a neural network to reverse this process — denoise step by step. At inference: start with pure noise, apply the denoiser 1000 times → coherent image. This is how Stable Diffusion, DALL·E 3, and Midjourney work.

Why diffusion replaced GANs

GANs are notoriously hard to train (mode collapse, training instability). Diffusion models are more stable to train and produce better diversity. However, they're 10-100× slower at inference. Current research (consistency models, flow matching) is closing this gap.

Topic 08

Computer Vision Metrics

Metric	Task	What it measures
Top-1 / Top-5 Accuracy	Classification	Is correct label in top 1/5 predictions?
IoU	Detection/Segmentation	Intersection over Union of predicted vs ground-truth box
mAP	Object Detection	Mean Average Precision across all classes and IoU thresholds
FID	Image Generation	Fréchet Inception Distance — lower = more realistic generated images
SSIM	Image Quality	Structural Similarity — matches human perception of image quality

Computer Vision
OpenCV, CNN Architectures, YOLO & GANs — Complete Guide

Contents