Back to Tracker

Deep Learning

Neural Networks · Backprop · PyTorch · CNNs · RNNs · Transformers

Core Module3 Weeks8 LessonsPrepflix AI Roadmap
Neural Networks from Scratch

Forward Pass

z¹ = W¹x + b¹ (linear)
a¹ = f(z¹) (activation)
z² = W²a¹ + b²
ŷ = softmax(z²) (output)

A neural network is just a stack of linear transformations + non-linear activations. Without activations, stacking layers = one linear layer.

Loss Functions

Binary Cross-Entropy−(y·log(p)+(1−y)·log(1−p))
Categorical CE−Σ yᵢ·log(pᵢ) (multi-class)
MSE(1/n)Σ(y−ŷ)² (regression)
MAE/HuberRobust to outliers
Contrastive/TripletEmbeddings, face recognition
Backpropagation & Optimizers

Backprop in 3 Lines

∂L/∂W² = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂W²
∂L/∂W¹ = ∂L/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹

Chain rule applied backwards. PyTorch autograd does this automatically by building a computation graph during forward pass.

# PyTorch autograd example loss.backward() # compute all gradients optimizer.step() # update weights optimizer.zero_grad() # ALWAYS zero before next batch

Optimizer Comparison

SGD + MomentumStable, needs LR tuning. Best for CV.
AdamAdaptive LR, fast convergence. Default choice.
AdamWAdam + weight decay decoupled. Best for Transformers.
RMSPropAdaptive LR, good for RNNs.
Adam: m_t = β₁m_{t-1}+(1-β₁)g
v_t = β₂v_{t-1}+(1-β₂)g²
θ -= α·m̂_t/(√v̂_t + ε)
Activations, Loss & Regularization

Activation Functions

ReLUmax(0,x) — fast, default for hidden layers
LeakyReLUmax(0.01x, x) — fixes dying ReLU
GELUx·Φ(x) — used in Transformers (BERT, GPT)
SigmoidOutput layer for binary classification
SoftmaxOutput layer for multi-class
Tanh[-1,1] range, used in RNNs/LSTMs

Regularization Techniques

DropoutZero activations randomly (p=0.1-0.5). Only during training.
Batch NormNormalize layer inputs. Reduces covariate shift.
Layer NormNormalize across features. Used in Transformers.
Weight DecayL2 penalty on weights via optimizer.
Early StoppingStop when val loss stops improving.
BN order debate: Conv → BN → ReLU (original) vs Conv → ReLU → BN (some prefer). BN before activation is more common.
PyTorch: The Training Loop
import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset # Define model class MLP(nn.Module): def __init__(self, in_dim, hidden, out_dim): super().__init__() self.net = nn.Sequential( nn.Linear(in_dim, hidden), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden, out_dim) ) def forward(self, x): return self.net(x) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = MLP(784, 256, 10).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) criterion = nn.CrossEntropyLoss() # Training loop (canonical pattern) for epoch in range(num_epochs): model.train() for X_batch, y_batch in train_loader: X_batch, y_batch = X_batch.to(device), y_batch.to(device) optimizer.zero_grad() pred = model(X_batch) loss = criterion(pred, y_batch) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping optimizer.step() scheduler.step() model.eval() with torch.no_grad(): # disable gradient computation # compute val loss and accuracy pass
Common Bugs: Forgetting model.eval() before inference (dropout stays active!), forgetting zero_grad() (gradients accumulate!), using .detach() incorrectly. Always call these in the right order.
Convolutional Neural Networks (CNNs)

Conv Layer Output Size

Out = ⌊(In + 2P − K)/S⌋ + 1

In=input size, P=padding, K=kernel size, S=stride

Conv 3×3, s=1, p=1Same spatial size
Conv 3×3, s=2, p=1Halves spatial size
MaxPool 2×2, s=2Halves spatial size

Receptive Field

Each deeper layer "sees" a larger area of the input. Deeper = larger receptive field = more context.

Architecture Evolution

LeNet (1998)Conv-Pool-Conv-Pool-FC
AlexNet (2012)Deep CNN, ReLU, Dropout. ImageNet breakthrough.
ResNet (2015)Skip connections: y = F(x) + x. Enables 100+ layers.
EfficientNetCompound scaling of depth/width/resolution.
ViT (2020)Patches + positional embeddings + Transformer.
Residual Connection: y = F(x) + x solves vanishing gradients — gradient flows directly through the skip connection.
Attention Mechanism & Transformers

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V
  • Q (Query): What am I looking for?
  • K (Key): What do I contain?
  • V (Value): What do I return?
  • √d_k prevents softmax saturation (vanishing gradients)
  • O(n²) memory — quadratic in sequence length

Multi-Head Attention

MultiHead = Concat(head₁,...,headₕ)·W_O
headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

Different heads learn different types of relationships: syntax, coreference, semantics. h=8 or 16 typically.

Transformer Block

Input → LayerNormMulti-Head Attention → Residual → LayerNormFFN → Residual → Output

Why Transformers replaced RNNs: RNNs process tokens sequentially (slow). Transformers attend to all positions simultaneously (parallelizable). RNNs have vanishing gradients over long sequences. Transformers have direct connections to any position via attention.