Gradient Descent Zoo — Mastering optimization
Gradient descent is how neural networks learn. But there's not just one way to descend. Explore SGD, momentum, adaptive methods, and Adam. Watch optimizers race. Roll balls down hills. Tune learning rates. By the end, you'll understand not just the math, but the intuition behind why each optimizer works.
Why gradient descent matters
Neural networks train via a simple loop: compute loss, calculate gradients, update weights. The algorithm that does the update is called an optimizer. The simplest optimizer is vanilla stochastic gradient descent (SGD): move opposite the gradient, scaled by a learning rate.
But the loss landscape is rarely smooth. Some directions have steep cliffs. Others are flat plateaus. Some dimensions need bigger steps, others need smaller ones. This is where clever optimizers shine. They adapt to the geometry of the problem.
The training loop
Forward pass → Compute loss → Backward pass (gradients) → Optimizer step (update weights). Repeat thousands of times. The optimizer is the engine of learning.
Optimizer Race: Which path wins?
| Optimizer | Final Loss | Steps to convergence |
|---|
Watch how different optimizers navigate the same loss landscape. Some take smooth paths, others adapt step size. Which reaches the minimum fastest?
Quick check
What is the role of the optimizer in training?
Quick check
Why don't all optimizers take the same path down a loss surface?
Gradient descent is iterative optimization. Starting with weights θ, we repeatedly:
θ_{t+1} = θ_t - α * ∇L(θ_t)
where α is the learning rate and ∇L is the gradient of the loss. The gradient points in the direction of steepest ascent; we move opposite it. For vectorized parameters, this applies element-wise to each weight. In practice, we compute gradients on mini-batches (stochastic gradient descent) rather than the full dataset (batch gradient descent) for efficiency and better generalization.
Momentum: Accelerate with history
Vanilla SGD can be slow on plateaus and oscillate on steep slopes. Momentum fixes this by remembering past gradients. Instead of just following the current gradient, we accumulate velocity and build momentum toward the minimum.
Think of a ball rolling downhill. It doesn't accelerate and decelerate with every tiny dip in the terrain. Instead, it builds speed and coasts through flat areas. Momentum does the same for optimization.
How momentum works
Keep a velocity vector. Each step, update velocity: v = γ*v - α*∇L. Then update weights: θ = θ + v. The typical momentum coefficient γ = 0.9 means keep 90% of old velocity. This smooths out noise and accelerates learning.
Momentum: Rolling balls down a hill
SGD
Position: 20.0
Follows gradient only
Momentum
Position: 20.0
Velocity: 0.000
With momentum, the ball accumulates velocity and can overshoot shallow areas, converging faster. Higher momentum = more inertia = faster descent but risk of overshooting.
Quick check
What is the effect of increasing the momentum coefficient from 0.5 to 0.99?
Quick check
Nesterov momentum evaluates the gradient where?
The momentum update rule maintains a velocity vector v and applies it each step:
v_t = γ * v_{t-1} - α * ∇L(θ_t)
θ_{t+1} = θ_t + v_t
This is equivalent to a moving average of gradients. When gradients consistently point in one direction, velocity accumulates and steps get bigger. When gradients change direction (like at the edges of a valley), velocity dampens and oscillations reduce.
Nesterov momentum modifies this by evaluating the gradient at the 'lookahead' position: ∇L(θ_t + γ*v_t) instead of ∇L(θ_t). This makes it converge faster because it's more responsive to the geometry ahead.
Adaptive learning rates: Let each dimension learn its own pace
AdaGrad
Accumulates squared gradients. Parameters with large gradients get smaller step sizes.
Problem: Learning rate monotonically decreases. Can die out in long training.
RMSProp
Use exponential moving average of squared gradients instead of cumulative. Fixes AdaGrad's decay problem.
More stable learning rate. Doesn't die out. Widely used in practice.
Intuition
Different parameters have different gradient scales. Adaptive methods detect this and apply smaller steps to noisy/spiky dimensions, larger steps to smooth dimensions.
Quick check
Why does AdaGrad's learning rate monotonically decrease?
AdaGrad:
g_t = g_{t-1} + (∇L)²
θ_{t+1} = θ_t - α * ∇L / (√g_t + ε)
where g_t accumulates squared gradients over all time. The denominator only grows, so step size shrinks.
RMSProp:
g_t = β * g_{t-1} + (1-β) * (∇L)²
θ_{t+1} = θ_t - α * ∇L / (√g_t + ε)
RMSProp uses a decay rate β (typically 0.9 or 0.99). Recent squared gradients matter more than old ones, preventing the denominator from growing unboundedly.
Adam: The industry standard
Adam (Adaptive Moment Estimation) combines the best of momentum and adaptive rates. It maintains two things:
- 1st moment:Exponential average of gradients (like momentum)
- 2nd moment:Exponential average of squared gradients (like RMSProp)
The result? Adam adapts per-dimension learning rates like RMSProp while maintaining momentum for faster convergence. It typically works well out of the box with default parameters. Most modern deep learning uses Adam.
Default hyperparameters
Learning rate: 0.001, β1 = 0.9, β2 = 0.999, ε = 10⁻⁸. These defaults work well for most tasks. You rarely need to tune them.
The 6-step journey through optimizers
Follow each optimizer concept step by step. Expand to dive deep.
Step 1: Vanilla SGD: Follow the slope
The simplest optimizer: just move in the opposite direction of the gradient.
⚡ Key insight
SGD is the foundation. All fancier optimizers build on this idea, but make it smarter.
Optimization path on Rosenbrock function
Quick check
Why does Adam typically work better than SGD without tuning?
Adam maintains two exponential moving averages:
m_t = β1 * m_{t-1} + (1-β1) * ∇L(θ_t)
v_t = β2 * v_{t-1} + (1-β2) * (∇L(θ_t))²
These are biased toward zero at the start (especially with small t), so we apply bias correction:
m_hat_t = m_t / (1 - β1^t)
v_hat_t = v_t / (1 - β2^t)
Finally, the update: θ_{t+1} = θ_t - α * m_hat_t / (√v_hat_t + ε). The bias correctionis crucial early in training; after many steps it becomes negligible.
Learning rate schedules: Control the pace
Even the best optimizer needs the right learning rate at the right time. A constant learning rate often doesn't work: you need a large rate early to learn fast, but a small rate later to fine-tune without oscillating around the optimum.
Learning rate schedulers solve this by adjusting the learning rate during training. Common strategies include step decay (drop by a factor every N steps), cosine annealing (smooth decrease), and warmup (increase slowly at first).
When to use each schedule
- Step decay: Simple, effective, easy to control.
- Cosine annealing: Smooth, no hard drops. Popular in modern training.
- Warmup: Essential for transformers and other large models. Prevents instability.
Learning rate schedulers
Step decay
Reduces LR by factor every N steps. Good for stable training.
Cosine annealing
Smooth decrease following cosine curve. Popular in modern training.
Warmup + Linear
Gradually increase then linearly decrease. Prevents instability at start.
Learning rate schedulers control how the learning rate changes during training. Start high to learn fast, then decay to fine-tune. Different schedules suit different problems.
Quick check
Why is a warmup schedule often used with large models like transformers?
Cosine annealing is a popular schedule that decreases the learning rate following a cosine curve:
α_t = α_min + (α_max - α_min) * (1 + cos(π * t / T)) / 2
where t is the current step, T is the total steps, α_max is the initial learning rate, and α_min is the minimum (usually 1% of α_max).
The cosine shape is smooth and has no hard discontinuities like step decay. It's become the default choice in modern deep learning, especially for transformers and large-scale training. Some practitioners use multiple restarts (cosine annealing with warm restarts) to escape local minima.
Playground: Experiment yourself
Interactive optimizer playground
Progress
Step: 0 / 200
Current position
Loss value
Experiment with different optimizers and learning rates. Watch how step size affects convergence. Some optimizers handle difficult landscapes better than others!
Quick check
On the Ackley function, which optimizer typically converges fastest?
Quick check
If you increase the learning rate too much, what happens?
- Learning rate: Start with 0.001 for Adam. If loss doesn't decrease, try 0.0001. If loss diverges, try 0.0001. Too high diverges; too low trains forever.
- Batch size: Larger batches = stabler gradients but slower updates. 32–256 is typical.
- Scheduler: Start with constant learning rate. Add cosine annealing if loss plateaus late in training.
- Warmup: Only needed for very large models (> 100M params). Skip for small networks.
- Momentum (for SGD): 0.9 is standard. Rarely needs tuning.
You now understand the gradient descent zoo
✓ SGD is the foundation. Just follow the gradient, scaled by learning rate.
✓ Momentum accelerates by accumulating velocity. Great for plateaus.
✓ Adaptive methods (AdaGrad, RMSProp) scale per-dimension. Handle spiky gradients.
✓ Adam combines momentum + adaptive scaling. The industry standard.
✓ Learning rate schedulers control the pace. Start high, decay low.
Next, explore how optimizers interact with batch normalization, regularization, and loss functions. Each piece of the training pipeline affects the others.