Deep Learning — Prepflix Cheat Sheet

Neural Networks from Scratch

Forward Pass

z¹ = W¹x + b¹ (linear)

a¹ = f(z¹) (activation)

z² = W²a¹ + b²

ŷ = softmax(z²) (output)

A neural network is just a stack of linear transformations + non-linear activations. Without activations, stacking layers = one linear layer.

Loss Functions

Binary Cross-Entropy−(y·log(p)+(1−y)·log(1−p))

Categorical CE−Σ yᵢ·log(pᵢ) (multi-class)

MSE(1/n)Σ(y−ŷ)² (regression)

MAE/HuberRobust to outliers

Contrastive/TripletEmbeddings, face recognition

Backpropagation & Optimizers

Backprop in 3 Lines

∂L/∂W² = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂W²

∂L/∂W¹ = ∂L/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹

Chain rule applied backwards. PyTorch autograd does this automatically by building a computation graph during forward pass.

# PyTorch autograd example loss.backward() # compute all gradients optimizer.step() # update weights optimizer.zero_grad() # ALWAYS zero before next batch

Optimizer Comparison

SGD + MomentumStable, needs LR tuning. Best for CV.

AdamAdaptive LR, fast convergence. Default choice.

AdamWAdam + weight decay decoupled. Best for Transformers.

RMSPropAdaptive LR, good for RNNs.

Adam: m_t = β₁m_{t-1}+(1-β₁)g
v_t = β₂v_{t-1}+(1-β₂)g²
θ -= α·m̂_t/(√v̂_t + ε)

Activations, Loss & Regularization

Activation Functions

ReLUmax(0,x) — fast, default for hidden layers

LeakyReLUmax(0.01x, x) — fixes dying ReLU

GELUx·Φ(x) — used in Transformers (BERT, GPT)

SigmoidOutput layer for binary classification

SoftmaxOutput layer for multi-class

Tanh[-1,1] range, used in RNNs/LSTMs

Regularization Techniques

DropoutZero activations randomly (p=0.1-0.5). Only during training.

Batch NormNormalize layer inputs. Reduces covariate shift.

Layer NormNormalize across features. Used in Transformers.

Weight DecayL2 penalty on weights via optimizer.

Early StoppingStop when val loss stops improving.

BN order debate: Conv → BN → ReLU (original) vs Conv → ReLU → BN (some prefer). BN before activation is more common.

PyTorch: The Training Loop

import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset # Define model class MLP(nn.Module): def __init__(self, in_dim, hidden, out_dim): super().__init__() self.net = nn.Sequential( nn.Linear(in_dim, hidden), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden, out_dim) ) def forward(self, x): return self.net(x) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = MLP(784, 256, 10).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) criterion = nn.CrossEntropyLoss() # Training loop (canonical pattern) for epoch in range(num_epochs): model.train() for X_batch, y_batch in train_loader: X_batch, y_batch = X_batch.to(device), y_batch.to(device) optimizer.zero_grad() pred = model(X_batch) loss = criterion(pred, y_batch) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping optimizer.step() scheduler.step() model.eval() with torch.no_grad(): # disable gradient computation # compute val loss and accuracy pass

Common Bugs: Forgetting model.eval() before inference (dropout stays active!), forgetting zero_grad() (gradients accumulate!), using .detach() incorrectly. Always call these in the right order.

Convolutional Neural Networks (CNNs)

Conv Layer Output Size

Out = ⌊(In + 2P − K)/S⌋ + 1

In=input size, P=padding, K=kernel size, S=stride

Conv 3×3, s=1, p=1Same spatial size

Conv 3×3, s=2, p=1Halves spatial size

MaxPool 2×2, s=2Halves spatial size

Receptive Field

Each deeper layer "sees" a larger area of the input. Deeper = larger receptive field = more context.

Architecture Evolution

LeNet (1998)Conv-Pool-Conv-Pool-FC

AlexNet (2012)Deep CNN, ReLU, Dropout. ImageNet breakthrough.

ResNet (2015)Skip connections: y = F(x) + x. Enables 100+ layers.

EfficientNetCompound scaling of depth/width/resolution.

ViT (2020)Patches + positional embeddings + Transformer.

Residual Connection: y = F(x) + x solves vanishing gradients — gradient flows directly through the skip connection.

Attention Mechanism & Transformers

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V

Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What do I return?
√d_k prevents softmax saturation (vanishing gradients)
O(n²) memory — quadratic in sequence length

Multi-Head Attention

MultiHead = Concat(head₁,...,headₕ)·W_O

headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

Different heads learn different types of relationships: syntax, coreference, semantics. h=8 or 16 typically.

Transformer Block

Input → LayerNorm → Multi-Head Attention → Residual → LayerNorm → FFN → Residual → Output

Why Transformers replaced RNNs: RNNs process tokens sequentially (slow). Transformers attend to all positions simultaneously (parallelizable). RNNs have vanishing gradients over long sequences. Transformers have direct connections to any position via attention.