Interactive~18 minIntermediate

Gradient Descent Zoo — Mastering optimization

Gradient descent is how neural networks learn. But there's not just one way to descend. Explore SGD, momentum, adaptive methods, and Adam. Watch optimizers race. Roll balls down hills. Tune learning rates. By the end, you'll understand not just the math, but the intuition behind why each optimizer works.

Why gradient descent matters

Neural networks train via a simple loop: compute loss, calculate gradients, update weights. The algorithm that does the update is called an optimizer. The simplest optimizer is vanilla stochastic gradient descent (SGD): move opposite the gradient, scaled by a learning rate.

But the loss landscape is rarely smooth. Some directions have steep cliffs. Others are flat plateaus. Some dimensions need bigger steps, others need smaller ones. This is where clever optimizers shine. They adapt to the geometry of the problem.

The training loop

Forward passCompute lossBackward pass (gradients) → Optimizer step (update weights). Repeat thousands of times. The optimizer is the engine of learning.

Optimizer Race: Which path wins?

OptimizerFinal LossSteps to convergence

Watch how different optimizers navigate the same loss landscape. Some take smooth paths, others adapt step size. Which reaches the minimum fastest?

?

Quick check

What is the role of the optimizer in training?

?

Quick check

Why don't all optimizers take the same path down a loss surface?

Gradient descent is iterative optimization. Starting with weights θ, we repeatedly:

θ_{t+1} = θ_t - α * ∇L(θ_t)

where α is the learning rate and ∇L is the gradient of the loss. The gradient points in the direction of steepest ascent; we move opposite it. For vectorized parameters, this applies element-wise to each weight. In practice, we compute gradients on mini-batches (stochastic gradient descent) rather than the full dataset (batch gradient descent) for efficiency and better generalization.

Momentum: Accelerate with history

Vanilla SGD can be slow on plateaus and oscillate on steep slopes. Momentum fixes this by remembering past gradients. Instead of just following the current gradient, we accumulate velocity and build momentum toward the minimum.

Think of a ball rolling downhill. It doesn't accelerate and decelerate with every tiny dip in the terrain. Instead, it builds speed and coasts through flat areas. Momentum does the same for optimization.

How momentum works

Keep a velocity vector. Each step, update velocity: v = γ*v - α*∇L. Then update weights: θ = θ + v. The typical momentum coefficient γ = 0.9 means keep 90% of old velocity. This smooths out noise and accelerates learning.

Momentum: Rolling balls down a hill

No momentumHigh momentum
SGDMomentumMinimum

SGD

Position: 20.0

Follows gradient only

Momentum

Position: 20.0

Velocity: 0.000

With momentum, the ball accumulates velocity and can overshoot shallow areas, converging faster. Higher momentum = more inertia = faster descent but risk of overshooting.

?

Quick check

What is the effect of increasing the momentum coefficient from 0.5 to 0.99?

?

Quick check

Nesterov momentum evaluates the gradient where?

The momentum update rule maintains a velocity vector v and applies it each step:

v_t = γ * v_{t-1} - α * ∇L(θ_t)
θ_{t+1} = θ_t + v_t

This is equivalent to a moving average of gradients. When gradients consistently point in one direction, velocity accumulates and steps get bigger. When gradients change direction (like at the edges of a valley), velocity dampens and oscillations reduce.

Nesterov momentum modifies this by evaluating the gradient at the 'lookahead' position: ∇L(θ_t + γ*v_t) instead of ∇L(θ_t). This makes it converge faster because it's more responsive to the geometry ahead.

Adaptive learning rates: Let each dimension learn its own pace

AdaGrad

Accumulates squared gradients. Parameters with large gradients get smaller step sizes.

Problem: Learning rate monotonically decreases. Can die out in long training.

RMSProp

Use exponential moving average of squared gradients instead of cumulative. Fixes AdaGrad's decay problem.

More stable learning rate. Doesn't die out. Widely used in practice.

Intuition

Different parameters have different gradient scales. Adaptive methods detect this and apply smaller steps to noisy/spiky dimensions, larger steps to smooth dimensions.

?

Quick check

Why does AdaGrad's learning rate monotonically decrease?

AdaGrad:

g_t = g_{t-1} + (∇L)²
θ_{t+1} = θ_t - α * ∇L / (√g_t + ε)

where g_t accumulates squared gradients over all time. The denominator only grows, so step size shrinks.

RMSProp:

g_t = β * g_{t-1} + (1-β) * (∇L)²
θ_{t+1} = θ_t - α * ∇L / (√g_t + ε)

RMSProp uses a decay rate β (typically 0.9 or 0.99). Recent squared gradients matter more than old ones, preventing the denominator from growing unboundedly.

Adam: The industry standard

Adam (Adaptive Moment Estimation) combines the best of momentum and adaptive rates. It maintains two things:

  • 1st moment:Exponential average of gradients (like momentum)
  • 2nd moment:Exponential average of squared gradients (like RMSProp)

The result? Adam adapts per-dimension learning rates like RMSProp while maintaining momentum for faster convergence. It typically works well out of the box with default parameters. Most modern deep learning uses Adam.

Default hyperparameters

Learning rate: 0.001, β1 = 0.9, β2 = 0.999, ε = 10⁻⁸. These defaults work well for most tasks. You rarely need to tune them.

The 6-step journey through optimizers

Follow each optimizer concept step by step. Expand to dive deep.

Step 1: Vanilla SGD: Follow the slope

The simplest optimizer: just move in the opposite direction of the gradient.

Update rule: θ_{t+1} = θ_t - α * ∇L(θ_t)
α (learning rate) controls step size.
Fast to compute but can oscillate on curved surfaces.
No memory of past gradients.

⚡ Key insight

SGD is the foundation. All fancier optimizers build on this idea, but make it smarter.

Optimization path on Rosenbrock function

1 of 6
?

Quick check

Why does Adam typically work better than SGD without tuning?

Adam maintains two exponential moving averages:

m_t = β1 * m_{t-1} + (1-β1) * ∇L(θ_t)
v_t = β2 * v_{t-1} + (1-β2) * (∇L(θ_t))²

These are biased toward zero at the start (especially with small t), so we apply bias correction:

m_hat_t = m_t / (1 - β1^t)
v_hat_t = v_t / (1 - β2^t)

Finally, the update: θ_{t+1} = θ_t - α * m_hat_t / (√v_hat_t + ε). The bias correctionis crucial early in training; after many steps it becomes negligible.

Learning rate schedules: Control the pace

Even the best optimizer needs the right learning rate at the right time. A constant learning rate often doesn't work: you need a large rate early to learn fast, but a small rate later to fine-tune without oscillating around the optimum.

Learning rate schedulers solve this by adjusting the learning rate during training. Common strategies include step decay (drop by a factor every N steps), cosine annealing (smooth decrease), and warmup (increase slowly at first).

When to use each schedule

  • Step decay: Simple, effective, easy to control.
  • Cosine annealing: Smooth, no hard drops. Popular in modern training.
  • Warmup: Essential for transformers and other large models. Prevents instability.

Learning rate schedulers

StepsLearning RateConstantStep decayExponentialCosine annealingWarmup + linear

Step decay

Reduces LR by factor every N steps. Good for stable training.

Cosine annealing

Smooth decrease following cosine curve. Popular in modern training.

Warmup + Linear

Gradually increase then linearly decrease. Prevents instability at start.

Learning rate schedulers control how the learning rate changes during training. Start high to learn fast, then decay to fine-tune. Different schedules suit different problems.

?

Quick check

Why is a warmup schedule often used with large models like transformers?

Cosine annealing is a popular schedule that decreases the learning rate following a cosine curve:

α_t = α_min + (α_max - α_min) * (1 + cos(π * t / T)) / 2

where t is the current step, T is the total steps, α_max is the initial learning rate, and α_min is the minimum (usually 1% of α_max).

The cosine shape is smooth and has no hard discontinuities like step decay. It's become the default choice in modern deep learning, especially for transformers and large-scale training. Some practitioners use multiple restarts (cosine annealing with warm restarts) to escape local minima.

Playground: Experiment yourself

Interactive optimizer playground

Progress

Step: 0 / 200

Current position

Loss value

Experiment with different optimizers and learning rates. Watch how step size affects convergence. Some optimizers handle difficult landscapes better than others!

?

Quick check

On the Ackley function, which optimizer typically converges fastest?

?

Quick check

If you increase the learning rate too much, what happens?

  • Learning rate: Start with 0.001 for Adam. If loss doesn't decrease, try 0.0001. If loss diverges, try 0.0001. Too high diverges; too low trains forever.
  • Batch size: Larger batches = stabler gradients but slower updates. 32–256 is typical.
  • Scheduler: Start with constant learning rate. Add cosine annealing if loss plateaus late in training.
  • Warmup: Only needed for very large models (> 100M params). Skip for small networks.
  • Momentum (for SGD): 0.9 is standard. Rarely needs tuning.

You now understand the gradient descent zoo

✓ SGD is the foundation. Just follow the gradient, scaled by learning rate.

✓ Momentum accelerates by accumulating velocity. Great for plateaus.

✓ Adaptive methods (AdaGrad, RMSProp) scale per-dimension. Handle spiky gradients.

✓ Adam combines momentum + adaptive scaling. The industry standard.

✓ Learning rate schedulers control the pace. Start high, decay low.

Next, explore how optimizers interact with batch normalization, regularization, and loss functions. Each piece of the training pipeline affects the others.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.