Interactive~15 minIntermediate

Loss functions — measuring how wrong you are

Every neural network learns by minimizing a loss function — a number that quantifies how badly the model's predictions miss the truth. You'll explore MSE, MAE, and cross-entropy; see why some losses penalize outliers and others are robust; and understand why choosing the right loss is as important as choosing the right architecture.

Why loss functions matter

Training a neural network follows a loop: make a prediction, measure the error with a loss function, compute gradients, and update weights. The loss function is the compass that guides learning. Without it, the model has no signal about whether it's getting better or worse.

Different problems need different losses. A loss that works for regression might fail for classification. A loss robust to outliers might be too forgiving for clean data. This section teaches you to recognize which loss suits your problem and why.

The optimization loop

PredictLoss (measure error) → Gradient (where to go) → Update (adjust weights). Repeat. The loss function is the bridge between predictions and learning.

Loss surface explorer

Adjust the weight slider to move along the loss curve. Watch how the gradient (arrow) points toward lower loss.

wloss

WEIGHT

-1.000

LOSS

9.000

GRADIENT

-6.000

Tip: The gradient points in the direction of steepest ascent. Gradient descent moves in the opposite direction (downhill) to minimize loss.

A good loss function must be differentiable. Gradient descent relies on computing the gradient (slope) to know which direction to move. If the loss function is not smooth everywhere, gradients become undefined or unreliable, and optimization fails.

Smooth and convex losses (like MSE) are easiest to optimize because they have a single global minimum. Non-convex losses (like those in deep learning) have many local minima, but they often still work because local minima tend to generalize well in high dimensions.

The loss function also shapes the learned representation. A loss that heavily penalizes outliers (like MSE) will make the model's weights more "responsible" for fitting every point. A robust loss (like MAE) lets the model ignore extreme points and fit the typical trend.

?

Quick check

What is the primary purpose of a loss function in neural network training?

MSE vs MAE — regression losses

Regression predicts continuous values. The two most common loss functions are MSE (Mean Squared Error) and MAE (Mean Absolute Error). They differ in how they penalize large mistakes.

Loss comparison

Drag points to change predictions. Watch how MSE and MAE respond differently.

targetpred
True target
Prediction (drag to move)
Outlier
MSE (penalizes big errors²)0.038
MAE (linear penalty)0.180
Huber (robust to outliers)0.019

MSE: penalizes outliers

Squaring the error makes large mistakes very expensive. A 2-unit error costs 4x as much as a 1-unit error. Useful for clean data where all errors should be minimized equally. Sensitive to outliers.

MAE: robust to outliers

Linear penalty: a 2-unit error costs exactly twice as much. Less sensitive to extreme values. Good for noisy data where you want the model to fit the trend, not every outlier.

Huber loss combines MSE and MAE: for small errors, it behaves like MSE (smooth); for large errors, it behaves like MAE (linear). You control the boundary with a hyperparameter δ (delta).

Formula: If |error| ≤ δ, loss = 0.5 × error². If |error| > δ, loss = δ × (|error| - 0.5δ).

Huber is the pragmatic choice: smooth enough for optimization, robust enough for noisy data. It's often the default in production systems when you're not sure whether to choose MSE or MAE.

?

Quick check

You're building a house price predictor. Your data has a few extreme luxury estates (million-dollar outliers). Which loss function would you choose?

Cross-entropy — classification losses

Classification predicts discrete categories. The model outputs probabilities, and cross-entropy measures how close those probabilities are to the ground truth. The key insight: being confidently wrong is heavily penalized.

Cross-entropy for classification

Adjust the probability sliders to see how -log(p) penalizes confident wrong answers. The true label is Class 0.

0.700
0.200
0.100

Loss function: -log(p)

ploss

Probability of true class

70.0%

Cross-entropy loss

0.357

= -log(0.700)

Model confidence

✓ Confident and correct — low loss

Key insight:

Cross-entropy loss approaches infinity as you become more confident in the wrong class. This heavily penalizes overconfident mistakes — the model pays a steep price for being confidently wrong.

Binary cross-entropy

For 2-class problems (positive/negative, cat/not-cat). The model predicts a single probability p, and loss is -log(p) if true, or -log(1-p) if false. One sigmoid output.

Categorical cross-entropy

For 3+ classes (dog/cat/bird). The model outputs K probabilities (one per class) via softmax, and loss is -log(p_true) for the correct class. K outputs, one per class.

The loss is -log(p_true), where p_true is the probability assigned to the correct class.

  • If p_true = 0.9, loss = -log(0.9) ≈ 0.1 (low — good!)
  • If p_true = 0.5, loss = -log(0.5) ≈ 0.7 (moderate)
  • If p_true = 0.1, loss = -log(0.1) ≈ 2.3 (high)
  • If p_true = 0.01, loss = -log(0.01) ≈ 4.6 (very high!)

As p approaches 0, -log(p) approaches infinity. This design choice forces the model to be honest: if you're wrong, the penalty is proportional to how confidently wrong you were. This is much stricter than MSE, which linearly penalizes error.

?

Quick check

A model confidently predicts class A (95% probability) when the true answer is class B. What does cross-entropy loss look like?

The gradient — finding the direction to go

A loss function by itself doesn't teach anything. The gradient — the derivative ∂loss/∂weight — is what tells the optimizer which direction to adjust weights. If the gradient is positive, increasing the weight increases loss (go down). If negative, increasing the weight decreases loss (go up). Gradient descent moves opposite the gradient, downhill toward lower loss.

Formula: new_weight = old_weight − learning_rate × gradient

The learning rate controls step size. Too large: you overshoot and diverge. Too small: you learn slowly.

📍

Gradient > 0

Loss increases uphill. Move weight DOWN (opposite gradient).

📍

Gradient ≈ 0

You're near a minimum. Almost no adjustment needed.

📍

Gradient < 0

Loss increases downhill. Move weight UP (opposite gradient).

In a deep network, loss depends on the output, which depends on the hidden layer, which depends on the input, which depends on the weights. To get ∂loss/∂weight, you multiply all the partial derivatives:

∂loss/∂weight = (∂loss/∂output) × (∂output/∂hidden) × ... × (∂hidden/∂weight)

This is backpropagation: starting from the loss, you compute gradients layer by layer, moving backward from output to input. Modern frameworks (PyTorch, TensorFlow) do this automatically using autodiff (automatic differentiation).

A key challenge: in very deep networks, gradients can vanish (become tiny) or explode (become huge) as they propagate backward. This is why techniques like batch normalization and residual connections exist.

?

Quick check

You compute the gradient of MSE loss with respect to a weight and get ∇loss = −2.5. With learning rate = 0.1, what's the weight update?

Guided step-by-step walkthrough

Work through six interactive steps to build intuition for loss functions. Adjust sliders, watch metrics change, and see the concepts come alive.

A loss function measures how wrong you are

Every prediction has an error: the gap between what you predicted and the truth. A loss function turns that gap into a single number. Higher loss = bigger mistake. Your goal during training: make loss as small as possible.

The gap is your loss

Target: 2

Prediction: 1.50

Gap: 0.50

Prediction vs Target

01234─ Target╌ Prediction↕ Error

Loss = 0.50

Open playground

Generate random data (regression or classification), choose a loss function, and run gradient descent. Adjust the learning rate and iterations to see how they affect convergence. Add outliers and watch how MSE and MAE respond differently.

Choosing the right loss

📉

Regression

Predicting continuous values

MSE (default)
MAE (robust)
Huber (balanced)
💙

Binary Classification

2 classes (yes/no, positive/negative)

Binary Cross-Entropy
Sigmoid output
🎨

Multi-class Classification

3+ classes (cat/dog/bird)

Categorical Cross-Entropy
Softmax output
🤷

Unsure?

Safe default choices

Huber (regression)
Cross-Entropy (class)

Key takeaway

A loss function quantifies error and provides the gradient signal for learning. MSE and MAE suit regression; cross-entropy suits classification. The right choice depends on your data, problem, and robustness requirements. MSE is smooth and standard, MAE is robust to outliers, and cross-entropy forces probabilistic confidence. Master these three, and you understand the heart of deep learning.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.