Backpropagation — the algorithm that learns
Every neural network learns by comparing its predictions to the truth, then adjusting weights to reduce error. Backpropagation is the algorithm that shows you how much each weight mattered — and in which direction to change it. Once you understand the chain rule, you understand the most important algorithm in deep learning.
Why backpropagation matters
You have a neural network with millions of weights. You feed it input, it makes a prediction, and you measure the error (loss). Now what? You can't just try random changes — the space of possibilities is infinite.
Backpropagation solves this. It computes, for every weight, exactly how much that weight contributed to the error, and in which direction. This gradient tells you: increase this weight or decrease it? By how much? It transforms an impossible optimization problem into a solvable one.
The learning problem
Given weights W, input X, target Y, and loss function L:
Find: dL/dW (gradient of loss with respect to each weight)
Update: W ← W − α·(dL/dW) where α is learning rate
Forward Pass
Before the 1980s, neural networks were a dead end. Researchers could build networks but couldn't train them. David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-Propagating Errors" in 1986, reinvigorating the field.
Backprop wasn't entirely new — it appeared in earlier work by Linnainmaa (1970) and others — but Rumelhart, Hinton, and Williams showed how powerful it was when applied to multi-layer networks. It became the foundation of deep learning.
Today, every serious neural network training uses backprop (or a variant like Adam or RMSprop that modifies the gradients). Without it, deep learning as we know it wouldn't exist.
Quick check
Why can't we just try random weight changes to reduce loss?
The chain rule — calculus you already know
Backpropagation is just the chain rule applied to neural networks. You learned this in calculus: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). In a neural network, you have long chains: output = f(layer3(layer2(layer1(input)))). To find dLoss/dWeight, you multiply all these derivatives together.
Interactive chain rule explorer
Composition: y = f(g(x))
Input
x = 2.00
f(u)
Hidden
u = 6.00
f(u)
Output
y = 36.00
Chain rule: dy/dx = (dy/du) · (du/dx)
du/dx
Derivative of g
3.000
dy/du
Derivative of f
12.000
dy/dx = du/dx × dy/du
3.000 × 12.000 = 36.000
The chain rule tells us: to find the slope at x, multiply the slopes of the intermediate steps. Each inner arrow shows one derivative, and we multiply them together to get dy/dx.
In a neural network, every layer takes multiple inputs. The chain rule still applies — you just use partial derivatives. If z = f(x, y), then:
dz/dx = (∂f/∂x) + (∂f/∂y)(dy/dx)
Every weight affects the loss through multiple paths (one for each neuron it feeds into). Backprop accumulates all these contributions — that's what makes it so powerful.
Quick check
If you have dy/dz = 2 and dz/dx = 3, what is dy/dx?
Forward pass — inputs to loss
Before you can compute gradients, you need to run the network forward. Each layer: (1) multiplies inputs by weights, (2) adds biases, (3) applies activation function (sigmoid, ReLU, etc.). At the end, you compare the output to the target and compute loss.
Forward pass anatomy
Sigmoid squashes outputs to (0, 1). It's interpretable and smooth, but its derivative is always ≤ 0.25, which causes vanishing gradients in deep networks.
ReLU (max(0, x)) is simple and keeps large gradients flowing. But it's piecewise linear, so "dead" ReLU neurons (with negative inputs) never recover.
Modern networks often use variants: Leaky ReLU (allows small negative slopes), GELU, or SiLU. The choice affects both forward pass speed and backward pass stability.
Quick check
In the forward pass, what does the sigmoid activation do?
Backward pass — gradients flow home
This is the heart of backpropagation. Starting at the loss, we compute how much each weight contributed to the error. Step through the guided lesson below: watch the gradients propagate backward, see how each weight gets blamed or praised, and update weights to reduce loss.
A tiny 2-layer network
We start with a small neural network: 2 inputs, 2 hidden neurons with sigmoid activation, 1 output. Random weights initialized. Our job: predict a target value given input. The network is way off right now, so we need to learn.
Input 1
0.50
Input 2
0.30
Prediction
0.5025
Loss
0.0195
Network values
Input
Layer 1
Layer 2
Quick check
What does a large gradient on a weight mean?
The learning rate (α) controls step size. With the gradient pointing downhill, we move a little bit in that direction: NewWeight = OldWeight − α × Gradient.
Too high: we overshoot and miss the minimum (or diverge entirely). Too low: we crawl forward so slowly that training takes forever.
Typical values are 0.01 to 0.001 for SGD. Modern optimizers like Adam adapt the learning rate per weight, making training more robust to this choice.
What can go wrong: vanishing and exploding gradients
In very deep networks, gradients can shrink to zero (vanishing) or explode to infinity. Both break learning. Explore how network architecture and activation functions affect gradient flow.
Gradient magnitude flow (Sigmoid)
The vanishing gradient problem
Sigmoid derivatives are always ≤ 0.25. Multiplying many of these small values together causes gradients to shrink exponentially, making learning in deep networks very slow.
Gradient magnitude flow (ReLU)
The exploding gradient problem
Without proper weight initialization or normalization, ReLU can cause gradients to grow exponentially through layers, leading to instability and NaN values.
Vanishing Gradients
Each layer multiplies by a small number. After many layers, the product approaches zero. Early layers never learn.
Exploding Gradients
Each layer multiplies by a number larger than 1. The product grows exponentially. Weights become NaN.
Dead Neurons
With ReLU, if a neuron's weighted sum is negative, its output is 0 and its gradient is 0. It never recovers.
Learning Rate
Too high and you overshoot. Too low and you move at a snail's pace. Adaptive optimizers help tune this automatically.
For vanishing gradients: Residual connections (skip connections) let gradients bypass layers. Batch normalization stabilizes activation ranges. LSTM and GRU gates explicitly control gradient flow.
For exploding gradients: Gradient clipping caps the norm. Weight regularization (L2) keeps weights small. Careful weight initialization (Xavier, He initialization) starts with a good scale.
For dead ReLU: Leaky ReLU allows small negative slopes. ELU and SELU have other tricks.For learning rate: Adaptive optimizers (Adam, RMSprop) adjust per weight, handling different gradient magnitudes gracefully.
Quick check
Why does sigmoid cause vanishing gradients in deep networks?
Open playground
Design your own neural network: choose the number of layers, neurons per layer, and activation function. Feed it inputs, set a target, and watch it learn. Train for 1 step to see gradients, or 100 steps to converge. Experiment with architecture choices and see how they affect learning.
Key takeaway
Backpropagation is the chain rule in reverse. You run the network forward, compute loss, then trace backward multiplying derivatives. Each weight gets a gradient telling you how much to adjust it. This simple idea — repeat for millions of examples — enables machines to learn from data. It's the bedrock of modern AI.