Contents
Deep learning is just neural networks with many layers. Each layer learns increasingly abstract representations — early layers detect edges, middle layers detect shapes, final layers detect concepts. The key breakthroughs: more data, more compute, and the specific architecture choices that made training stable.
Topic 01
Neurons & The Forward Pass
Biological inspiration (and why it doesn't matter too much)
A biological neuron receives signals from many other neurons, sums them up, and fires if the total exceeds a threshold. An artificial neuron does the same: multiply each input by a weight, add a bias, and apply an activation function.
A single artificial neuron: weighted sum → activation → output
Stack many neurons in parallel → a layer. Stack many layers → a deep neural network. The forward pass is simply passing input data through every layer, left to right, until you get a prediction.
Topic 02
Backpropagation: Blame Assignment
After the forward pass, we compute the loss (how wrong the prediction was). Backpropagation figures out how much each weight contributed to that error — so we can adjust them.
Think of it as a factory that produced a defective product. The factory manager traces back: was the defect in the raw material, the assembly step, or the quality check? Each step gets a "blame score." In neural networks, this blame score is the gradient — dLoss/dWeight.
Backprop is just the chain rule applied recursively from output to input. Modern frameworks (PyTorch, TensorFlow) compute this automatically via autograd — you never implement it manually. But understanding it helps you debug gradient explosions/vanishing gradients.
Topic 03
Activation Functions & Why Non-Linearity Matters
Without activation functions, stacking layers does nothing — a sequence of linear transformations is still linear. Non-linear activations let neural networks learn complex, curved decision boundaries.
| Activation | Formula | Use case | Pitfall |
|---|---|---|---|
| ReLU | max(0, x) | Hidden layers (default choice) | Dying ReLU: neurons stuck at 0 |
| Leaky ReLU | max(0.01x, x) | Hidden layers when dying ReLU is a problem | Slight computational overhead |
| Sigmoid | 1/(1+e⁻ˣ) | Binary output layer | Vanishing gradients for deep nets |
| Tanh | (e^x-e^-x)/(e^x+e^-x) | RNNs, when zero-centering matters | Also saturates at extremes |
| Softmax | e^xᵢ/Σe^xⱼ | Multi-class output layer | Not for hidden layers |
| GELU | x·Φ(x) | Transformers (BERT, GPT) | Slightly more compute than ReLU |
Topic 04
PyTorch: The Training Loop
PyTorch is Python-first, dynamic graph, and feels like NumPy with GPU support and automatic differentiation. The training loop has exactly 4 steps, always in this order:
Forgetting optimizer.zero_grad(). PyTorch accumulates gradients by default. Without zeroing, gradients from previous batches add up and your model diverges.
Topic 05
Convolutional Neural Networks (CNNs)
CNNs are specialized for grid-structured data (images, audio spectrograms). Instead of connecting every neuron to every input pixel (expensive and wasteful), CNNs use local receptive fields — small filters that slide across the image.
- Convolution: A 3×3 filter slides across the image, computing dot products at each position → produces a feature map
- Filters learn features: Early layers learn edges, later layers learn textures, deepest layers learn objects
- Pooling: Max-pool reduces spatial dimensions while keeping the strongest activations
- Translation invariance: A cat in the top-left and bottom-right activates the same "cat detector" filter
Topic 06
Transfer Learning: Don't Reinvent the Wheel
Training a CNN from scratch on ImageNet takes weeks on 8 GPUs. Transfer learning says: use a model already trained on millions of images, then adapt it to your task.
- Load a pretrained model (ResNet50, EfficientNet, ViT) — trained on ImageNet's 1.2M images
- Freeze the early layers (they detect universal features: edges, textures)
- Replace the final classification head with your own (correct number of output classes)
- Fine-tune on your dataset (even with 1000 images, you can get excellent results)
A medical imaging startup trained a skin cancer classifier with 500 labeled images. Using transfer learning from a ResNet pretrained on ImageNet, they achieved 94% accuracy. Training from scratch with 500 images would give ~60%. Transfer learning is almost always the right starting point.
Topic 07
RNNs & LSTMs: Memory Across Time
For sequential data (text, time series, audio), we need models with memory. An RNN processes one element at a time, maintaining a hidden state that carries information from previous steps.
The problem: vanishing gradients. Information from 50 steps ago barely influences the gradient — the RNN effectively has short-term memory. LSTMs solve this with 3 gating mechanisms:
- Forget gate: "What from my memory should I forget?" (sigmoid → 0 = forget, 1 = keep)
- Input gate: "What new information should I store?"
- Output gate: "What of my memory should I output now?"
LSTMs have a "cell state" — like a conveyor belt that can carry information through hundreds of steps with minimal modification. This is why they outperform vanilla RNNs on long sequences. Today, Transformers have largely replaced LSTMs for NLP — but LSTMs still dominate in time series forecasting.
Topic 08
Attention Mechanism: The Foundation of Transformers
RNNs read text left-to-right, maintaining a single hidden state. By the time you reach word 100, information about word 1 is diluted. Attention fixes this by letting every word directly look at every other word.
Think of it as a lookup table: for each word (Query), we look through all other words (Keys) and retrieve information from them (Values), weighted by relevance.
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What information do I provide?"
- √d_k scaling: Prevents dot products from getting too large for softmax
"The bank by the river was muddy." — To understand "bank", the attention mechanism learns to attend strongly to "river" and weakly to other words. This context disambiguates the meaning. RNNs struggle with this; attention handles it naturally.
- Watch Andrej Karpathy's "Neural Networks: Zero to Hero" — implement everything from scratch
- Build the 4-step PyTorch training loop from memory on 3 different datasets
- Fine-tune a pretrained ResNet on a custom image dataset (use Hugging Face or torchvision)
- Implement a simple transformer from scratch following Karpathy's "makemore" series