Vamsi Krishna Sankarayogi — Technologist at Heart

Why backpropagation matters

You have a neural network with millions of weights. You feed it input, it makes a prediction, and you measure the error (loss). Now what? You can't just try random changes — the space of possibilities is infinite.

Backpropagation solves this. It computes, for every weight, exactly how much that weight contributed to the error, and in which direction. This gradient tells you: increase this weight or decrease it? By how much? It transforms an impossible optimization problem into a solvable one.

The learning problem

Given weights W, input X, target Y, and loss function L:
Find: dL/dW (gradient of loss with respect to each weight)
Update: W ← W − α·(dL/dW) where α is learning rate

Forward Pass

Forward flow

Backward flow

Before the 1980s, neural networks were a dead end. Researchers could build networks but couldn't train them. David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-Propagating Errors" in 1986, reinvigorating the field.

Backprop wasn't entirely new — it appeared in earlier work by Linnainmaa (1970) and others — but Rumelhart, Hinton, and Williams showed how powerful it was when applied to multi-layer networks. It became the foundation of deep learning.

Today, every serious neural network training uses backprop (or a variant like Adam or RMSprop that modifies the gradients). Without it, deep learning as we know it wouldn't exist.

?

Quick check

Why can't we just try random weight changes to reduce loss?

The chain rule — calculus you already know

Backpropagation is just the chain rule applied to neural networks. You learned this in calculus: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). In a neural network, you have long chains: output = f(layer3(layer2(layer1(input)))). To find dLoss/dWeight, you multiply all these derivatives together.

Interactive chain rule explorer

Inner function g(x)

Outer function f(u)

Input x

2.00

Composition: y = f(g(x))

Input

x = 2.00

f(u)

Hidden

u = 6.00

f(u)

Output

y = 36.00

Chain rule: dy/dx = (dy/du) · (du/dx)

du/dx

Derivative of g

3.000

×

dy/du

Derivative of f

12.000

dy/dx = du/dx × dy/du

3.000 × 12.000 = 36.000

The chain rule tells us: to find the slope at x, multiply the slopes of the intermediate steps. Each inner arrow shows one derivative, and we multiply them together to get dy/dx.

In a neural network, every layer takes multiple inputs. The chain rule still applies — you just use partial derivatives. If z = f(x, y), then:

dz/dx = (∂f/∂x) + (∂f/∂y)(dy/dx)

Every weight affects the loss through multiple paths (one for each neuron it feeds into). Backprop accumulates all these contributions — that's what makes it so powerful.

?

Quick check

If you have dy/dz = 2 and dz/dx = 3, what is dy/dx?

Forward pass — inputs to loss

Before you can compute gradients, you need to run the network forward. Each layer: (1) multiplies inputs by weights, (2) adds biases, (3) applies activation function (sigmoid, ReLU, etc.). At the end, you compare the output to the target and compute loss.

Forward pass anatomy

# Layer 1: inputs → hidden

z1 = input · W1 + b1

a1 = sigmoid(z1)

# Layer 2: hidden → output

z2 = a1 · W2 + b2

output = z2 (linear output)

# Loss

loss = 0.5 × (output − target)²

Sigmoid squashes outputs to (0, 1). It's interpretable and smooth, but its derivative is always ≤ 0.25, which causes vanishing gradients in deep networks.

ReLU (max(0, x)) is simple and keeps large gradients flowing. But it's piecewise linear, so "dead" ReLU neurons (with negative inputs) never recover.

Modern networks often use variants: Leaky ReLU (allows small negative slopes), GELU, or SiLU. The choice affects both forward pass speed and backward pass stability.

?

Quick check

In the forward pass, what does the sigmoid activation do?

Backward pass — gradients flow home

This is the heart of backpropagation. Starting at the loss, we compute how much each weight contributed to the error. Step through the guided lesson below: watch the gradients propagate backward, see how each weight gets blamed or praised, and update weights to reduce loss.

A tiny 2-layer network

We start with a small neural network: 2 inputs, 2 hidden neurons with sigmoid activation, 1 output. Random weights initialized. Our job: predict a target value given input. The network is way off right now, so we need to learn.

Learning rate: 0.30

Target value: 0.70

Input 1

0.50

Input 2

0.30

Prediction

0.5025

Loss

0.0195

Network values

Input

0.50

0.30

Layer 1

0.51

0.50

Layer 2

0.50

?

Quick check

What does a large gradient on a weight mean?

The learning rate (α) controls step size. With the gradient pointing downhill, we move a little bit in that direction: NewWeight = OldWeight − α × Gradient.

Too high: we overshoot and miss the minimum (or diverge entirely). Too low: we crawl forward so slowly that training takes forever.

Typical values are 0.01 to 0.001 for SGD. Modern optimizers like Adam adapt the learning rate per weight, making training more robust to this choice.

What can go wrong: vanishing and exploding gradients

In very deep networks, gradients can shrink to zero (vanishing) or explode to infinity. Both break learning. Explore how network architecture and activation functions affect gradient flow.

Gradient magnitude flow (Sigmoid)

Layer 1

0.312

OK

Layer 2

0.033

Warn

Layer 3

0.019

Warn

Layer 4

0.004

Bad

Layer 5

0.001

Bad

The vanishing gradient problem

Sigmoid derivatives are always ≤ 0.25. Multiplying many of these small values together causes gradients to shrink exponentially, making learning in deep networks very slow.

Healthy (0.1 - 0.5)

Warning (<0.1 or >0.5)

Critical (<0.01 or >1.0)

Gradient magnitude flow (ReLU)

Layer 1

0.196

OK

Layer 2

0.419

OK

Layer 3

0.389

OK

Layer 4

0.365

OK

Layer 5

0.376

OK

The exploding gradient problem

Without proper weight initialization or normalization, ReLU can cause gradients to grow exponentially through layers, leading to instability and NaN values.

Healthy (0.1 - 0.5)

Warning (<0.1 or >0.5)

Critical (<0.01 or >1.0)

📉

Vanishing Gradients

Each layer multiplies by a small number. After many layers, the product approaches zero. Early layers never learn.

📈

Exploding Gradients

Each layer multiplies by a number larger than 1. The product grows exponentially. Weights become NaN.

💀

Dead Neurons

With ReLU, if a neuron's weighted sum is negative, its output is 0 and its gradient is 0. It never recovers.

⚡

Learning Rate

Too high and you overshoot. Too low and you move at a snail's pace. Adaptive optimizers help tune this automatically.

For vanishing gradients: Residual connections (skip connections) let gradients bypass layers. Batch normalization stabilizes activation ranges. LSTM and GRU gates explicitly control gradient flow.

For exploding gradients: Gradient clipping caps the norm. Weight regularization (L2) keeps weights small. Careful weight initialization (Xavier, He initialization) starts with a good scale.

For dead ReLU: Leaky ReLU allows small negative slopes. ELU and SELU have other tricks.For learning rate: Adaptive optimizers (Adam, RMSprop) adjust per weight, handling different gradient magnitudes gracefully.

?

Quick check

Why does sigmoid cause vanishing gradients in deep networks?

Open playground

Design your own neural network: choose the number of layers, neurons per layer, and activation function. Feed it inputs, set a target, and watch it learn. Train for 1 step to see gradients, or 100 steps to converge. Experiment with architecture choices and see how they affect learning.

Key takeaway

Backpropagation is the chain rule in reverse. You run the network forward, compute loss, then trace backward multiplying derivatives. Each weight gets a gradient telling you how much to adjust it. This simple idea — repeat for millions of examples — enables machines to learn from data. It's the bedrock of modern AI.

Backpropagation — the algorithm that learns

Why backpropagation matters

The chain rule — calculus you already know

Forward pass — inputs to loss

Backward pass — gradients flow home

A tiny 2-layer network

What can go wrong: vanishing and exploding gradients

Open playground

Finished this lesson?