Deep learning is just neural networks with many layers. Each layer learns increasingly abstract representations — early layers detect edges, middle layers detect shapes, final layers detect concepts. The key breakthroughs: more data, more compute, and the specific architecture choices that made training stable.

Topic 01

Neurons & The Forward Pass

BIOLOGY

Biological inspiration (and why it doesn't matter too much)

A biological neuron receives signals from many other neurons, sums them up, and fires if the total exceeds a threshold. An artificial neuron does the same: multiply each input by a weight, add a bias, and apply an activation function.

output = activation(Σ wᵢxᵢ + b) = activation(W·x + b)
x₁ x₂ x₃ w₁ w₂ w₃ Σ + b ReLU ŷ

A single artificial neuron: weighted sum → activation → output

Stack many neurons in parallel → a layer. Stack many layers → a deep neural network. The forward pass is simply passing input data through every layer, left to right, until you get a prediction.

Topic 02

Backpropagation: Blame Assignment

After the forward pass, we compute the loss (how wrong the prediction was). Backpropagation figures out how much each weight contributed to that error — so we can adjust them.

Think of it as a factory that produced a defective product. The factory manager traces back: was the defect in the raw material, the assembly step, or the quality check? Each step gets a "blame score." In neural networks, this blame score is the gradient — dLoss/dWeight.

dL/dw₁ = dL/dŷ × dŷ/dz × dz/dw₁    (chain rule)
Key Insight

Backprop is just the chain rule applied recursively from output to input. Modern frameworks (PyTorch, TensorFlow) compute this automatically via autograd — you never implement it manually. But understanding it helps you debug gradient explosions/vanishing gradients.

Topic 03

Activation Functions & Why Non-Linearity Matters

Without activation functions, stacking layers does nothing — a sequence of linear transformations is still linear. Non-linear activations let neural networks learn complex, curved decision boundaries.

ActivationFormulaUse casePitfall
ReLUmax(0, x)Hidden layers (default choice)Dying ReLU: neurons stuck at 0
Leaky ReLUmax(0.01x, x)Hidden layers when dying ReLU is a problemSlight computational overhead
Sigmoid1/(1+e⁻ˣ)Binary output layerVanishing gradients for deep nets
Tanh(e^x-e^-x)/(e^x+e^-x)RNNs, when zero-centering mattersAlso saturates at extremes
Softmaxe^xᵢ/Σe^xⱼMulti-class output layerNot for hidden layers
GELUx·Φ(x)Transformers (BERT, GPT)Slightly more compute than ReLU

Topic 04

PyTorch: The Training Loop

PyTorch is Python-first, dynamic graph, and feels like NumPy with GPU support and automatic differentiation. The training loop has exactly 4 steps, always in this order:

import torch import torch.nn as nn # Define model model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10) ) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) criterion = nn.CrossEntropyLoss() for epoch in range(10): for X_batch, y_batch in train_loader: # Step 1: Forward pass predictions = model(X_batch) loss = criterion(predictions, y_batch) # Step 2: Zero gradients (CRITICAL — don't forget!) optimizer.zero_grad() # Step 3: Backward pass loss.backward() # Step 4: Update weights optimizer.step() print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
Most common bug

Forgetting optimizer.zero_grad(). PyTorch accumulates gradients by default. Without zeroing, gradients from previous batches add up and your model diverges.

Topic 05

Convolutional Neural Networks (CNNs)

CNNs are specialized for grid-structured data (images, audio spectrograms). Instead of connecting every neuron to every input pixel (expensive and wasteful), CNNs use local receptive fields — small filters that slide across the image.

  • Convolution: A 3×3 filter slides across the image, computing dot products at each position → produces a feature map
  • Filters learn features: Early layers learn edges, later layers learn textures, deepest layers learn objects
  • Pooling: Max-pool reduces spatial dimensions while keeping the strongest activations
  • Translation invariance: A cat in the top-left and bottom-right activates the same "cat detector" filter

Topic 06

Transfer Learning: Don't Reinvent the Wheel

Training a CNN from scratch on ImageNet takes weeks on 8 GPUs. Transfer learning says: use a model already trained on millions of images, then adapt it to your task.

  1. Load a pretrained model (ResNet50, EfficientNet, ViT) — trained on ImageNet's 1.2M images
  2. Freeze the early layers (they detect universal features: edges, textures)
  3. Replace the final classification head with your own (correct number of output classes)
  4. Fine-tune on your dataset (even with 1000 images, you can get excellent results)
Real-world impact

A medical imaging startup trained a skin cancer classifier with 500 labeled images. Using transfer learning from a ResNet pretrained on ImageNet, they achieved 94% accuracy. Training from scratch with 500 images would give ~60%. Transfer learning is almost always the right starting point.

Topic 07

RNNs & LSTMs: Memory Across Time

For sequential data (text, time series, audio), we need models with memory. An RNN processes one element at a time, maintaining a hidden state that carries information from previous steps.

h_t = tanh(W_h × h_{t-1} + W_x × x_t + b)

The problem: vanishing gradients. Information from 50 steps ago barely influences the gradient — the RNN effectively has short-term memory. LSTMs solve this with 3 gating mechanisms:

  • Forget gate: "What from my memory should I forget?" (sigmoid → 0 = forget, 1 = keep)
  • Input gate: "What new information should I store?"
  • Output gate: "What of my memory should I output now?"
Key Insight

LSTMs have a "cell state" — like a conveyor belt that can carry information through hundreds of steps with minimal modification. This is why they outperform vanilla RNNs on long sequences. Today, Transformers have largely replaced LSTMs for NLP — but LSTMs still dominate in time series forecasting.

Topic 08

Attention Mechanism: The Foundation of Transformers

RNNs read text left-to-right, maintaining a single hidden state. By the time you reach word 100, information about word 1 is diluted. Attention fixes this by letting every word directly look at every other word.

Think of it as a lookup table: for each word (Query), we look through all other words (Keys) and retrieve information from them (Values), weighted by relevance.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V
  • Q (Query): "What am I looking for?"
  • K (Key): "What do I contain?"
  • V (Value): "What information do I provide?"
  • √d_k scaling: Prevents dot products from getting too large for softmax
Intuition

"The bank by the river was muddy." — To understand "bank", the attention mechanism learns to attend strongly to "river" and weakly to other words. This context disambiguates the meaning. RNNs struggle with this; attention handles it naturally.

Your learning path
  • Watch Andrej Karpathy's "Neural Networks: Zero to Hero" — implement everything from scratch
  • Build the 4-step PyTorch training loop from memory on 3 different datasets
  • Fine-tune a pretrained ResNet on a custom image dataset (use Hugging Face or torchvision)
  • Implement a simple transformer from scratch following Karpathy's "makemore" series