Deep Learning Fundamentals: Neural Networks & PyTorch

01Neurons & Forward Pass

02Backpropagation

03Activation Functions

04PyTorch Training Loop

05Convolutional Networks

06Transfer Learning

07RNNs & LSTMs

08Attention Mechanism

Deep learning is just neural networks with many layers. Each layer learns increasingly abstract representations — early layers detect edges, middle layers detect shapes, final layers detect concepts. The key breakthroughs: more data, more compute, and the specific architecture choices that made training stable.

Topic 01

Neurons & The Forward Pass

BIOLOGY

Biological inspiration (and why it doesn't matter too much)

A biological neuron receives signals from many other neurons, sums them up, and fires if the total exceeds a threshold. An artificial neuron does the same: multiply each input by a weight, add a bias, and apply an activation function.

output = activation(Σ wᵢxᵢ + b) = activation(W·x + b)

A single artificial neuron: weighted sum → activation → output

Stack many neurons in parallel → a layer. Stack many layers → a deep neural network. The forward pass is simply passing input data through every layer, left to right, until you get a prediction.

Topic 02

Backpropagation: Blame Assignment

After the forward pass, we compute the loss (how wrong the prediction was). Backpropagation figures out how much each weight contributed to that error — so we can adjust them.

Think of it as a factory that produced a defective product. The factory manager traces back: was the defect in the raw material, the assembly step, or the quality check? Each step gets a "blame score." In neural networks, this blame score is the gradient — dLoss/dWeight.

dL/dw₁ = dL/dŷ × dŷ/dz × dz/dw₁ (chain rule)

Key Insight

Backprop is just the chain rule applied recursively from output to input. Modern frameworks (PyTorch, TensorFlow) compute this automatically via autograd — you never implement it manually. But understanding it helps you debug gradient explosions/vanishing gradients.

Topic 03

Activation Functions & Why Non-Linearity Matters

Without activation functions, stacking layers does nothing — a sequence of linear transformations is still linear. Non-linear activations let neural networks learn complex, curved decision boundaries.

Activation	Formula	Use case	Pitfall
ReLU	max(0, x)	Hidden layers (default choice)	Dying ReLU: neurons stuck at 0
Leaky ReLU	max(0.01x, x)	Hidden layers when dying ReLU is a problem	Slight computational overhead
Sigmoid	1/(1+e⁻ˣ)	Binary output layer	Vanishing gradients for deep nets
Tanh	(e^x-e^-x)/(e^x+e^-x)	RNNs, when zero-centering matters	Also saturates at extremes
Softmax	e^xᵢ/Σe^xⱼ	Multi-class output layer	Not for hidden layers
GELU	x·Φ(x)	Transformers (BERT, GPT)	Slightly more compute than ReLU

Topic 04

PyTorch: The Training Loop

PyTorch is Python-first, dynamic graph, and feels like NumPy with GPU support and automatic differentiation. The training loop has exactly 4 steps, always in this order:

import torch
import torch.nn as nn

# Define model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for X_batch, y_batch in train_loader:
        # Step 1: Forward pass
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)

        # Step 2: Zero gradients (CRITICAL — don't forget!)
        optimizer.zero_grad()

        # Step 3: Backward pass
        loss.backward()

        # Step 4: Update weights
        optimizer.step()

    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
            

Most common bug

Forgetting optimizer.zero_grad(). PyTorch accumulates gradients by default. Without zeroing, gradients from previous batches add up and your model diverges.

Topic 05

Convolutional Neural Networks (CNNs)

CNNs are specialized for grid-structured data (images, audio spectrograms). Instead of connecting every neuron to every input pixel (expensive and wasteful), CNNs use local receptive fields — small filters that slide across the image.

Convolution: A 3×3 filter slides across the image, computing dot products at each position → produces a feature map
Filters learn features: Early layers learn edges, later layers learn textures, deepest layers learn objects
Pooling: Max-pool reduces spatial dimensions while keeping the strongest activations
Translation invariance: A cat in the top-left and bottom-right activates the same "cat detector" filter

Topic 06

Transfer Learning: Don't Reinvent the Wheel

Training a CNN from scratch on ImageNet takes weeks on 8 GPUs. Transfer learning says: use a model already trained on millions of images, then adapt it to your task.

Load a pretrained model (ResNet50, EfficientNet, ViT) — trained on ImageNet's 1.2M images
Freeze the early layers (they detect universal features: edges, textures)
Replace the final classification head with your own (correct number of output classes)
Fine-tune on your dataset (even with 1000 images, you can get excellent results)

Real-world impact

A medical imaging startup trained a skin cancer classifier with 500 labeled images. Using transfer learning from a ResNet pretrained on ImageNet, they achieved 94% accuracy. Training from scratch with 500 images would give ~60%. Transfer learning is almost always the right starting point.

Topic 07

RNNs & LSTMs: Memory Across Time

For sequential data (text, time series, audio), we need models with memory. An RNN processes one element at a time, maintaining a hidden state that carries information from previous steps.

h_t = tanh(W_h × h_{t-1} + W_x × x_t + b)

The problem: vanishing gradients. Information from 50 steps ago barely influences the gradient — the RNN effectively has short-term memory. LSTMs solve this with 3 gating mechanisms:

Forget gate: "What from my memory should I forget?" (sigmoid → 0 = forget, 1 = keep)
Input gate: "What new information should I store?"
Output gate: "What of my memory should I output now?"

Key Insight

LSTMs have a "cell state" — like a conveyor belt that can carry information through hundreds of steps with minimal modification. This is why they outperform vanilla RNNs on long sequences. Today, Transformers have largely replaced LSTMs for NLP — but LSTMs still dominate in time series forecasting.

Topic 08

Attention Mechanism: The Foundation of Transformers

RNNs read text left-to-right, maintaining a single hidden state. By the time you reach word 100, information about word 1 is diluted. Attention fixes this by letting every word directly look at every other word.

Think of it as a lookup table: for each word (Query), we look through all other words (Keys) and retrieve information from them (Values), weighted by relevance.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "What information do I provide?"
√d_k scaling: Prevents dot products from getting too large for softmax

Intuition

"The bank by the river was muddy." — To understand "bank", the attention mechanism learns to attend strongly to "river" and weakly to other words. This context disambiguates the meaning. RNNs struggle with this; attention handles it naturally.

Your learning path

Watch Andrej Karpathy's "Neural Networks: Zero to Hero" — implement everything from scratch
Build the 4-step PyTorch training loop from memory on 3 different datasets
Fine-tune a pretrained ResNet on a custom image dataset (use Hugging Face or torchvision)
Implement a simple transformer from scratch following Karpathy's "makemore" series

Deep Learning Fundamentals
Neural Networks, Backprop & PyTorch — Complete Guide

Contents