Interactive~20 minIntermediate

Information Theory — Measuring Uncertainty and Information

What is surprise? How much information does a rare event carry? Why does machine learning use cross-entropy loss? Information theory answers these questions. Learn how to measure entropy, compute KL divergence, understand optimal coding, and apply information-theoretic principles to deep learning.

Step-by-Step Guided Lesson

Start here to learn information theory from first principles: surprise, entropy, cross-entropy, KL divergence, mutual information, and real-world applications in machine learning.

Guided Walkthrough

Learn information theory step by step: from surprise to real-world ML applications.

Progress1 / 6

Step 1: Surprise (Information Content)

How much info do rare events give us?

Information is about surprise. A common event tells you little; a rare event tells you a lot.

Surprise (bits) = log₂(1 / probability)

- Rolling a 1 on a fair die: log₂(1/0.167) ≈ 2.58 bits

- Flipping heads on a fair coin: log₂(1/0.5) = 1 bit

- A certain event (p=1): log₂(1/1) = 0 bits

Example: Coin flip probabilities

Fair coin (p=0.5)

Maximum surprise

1

Biased coin (p=0.9)

Less surprising

0.152

Extreme bias (p=0.99)

Very unsurprising

0.014

1 of 6
?

Quick check

What does entropy measure?

Understanding Entropy

Entropy quantifies uncertainty. Adjust the probabilities below and watch how entropy changes. A uniform distribution (all outcomes equally likely) maximizes entropy. An extreme distribution (one certain outcome) minimizes entropy to zero.

Interactive Entropy Explorer

Adjust probabilities and watch entropy change in real time.

Formula

H(X) = −∑ p(x) log₂(p(x))

0.25
0.25
0.25
0.25

Entropy (bits)

2

Max: 2

Uniformity

100%

Nearly uniform

Entropy Range

Min (certain)Max (uniform)

What this means

This distribution is nearly uniform. You need close to the maximum bits to describe outcomes.

We use log base 2 because it measures information in bits. One bit is the amount of information needed to distinguish between two equally likely outcomes. If you have a fair coin (p=0.5 for each side), you need exactly 1 bit to encode the outcome: 0 for heads, 1 for tails.

With a fair 8-sided die, you need log₂(8) = 3 bits because there are 8 outcomes. With a skewed die where one face is much more likely, you need fewer bits on average.

Machine learning typically uses natural logarithm (ln) which measures in nats instead of bits. The math is identical; just a different unit. 1 nat ≈ 1.44 bits.

?

Quick check

Which distribution has higher entropy: [0.9, 0.1] or [0.5, 0.5]?

Cross-Entropy: The Cost of Being Wrong

Cross-entropy measures the cost of using one distribution to encode samples from another. In machine learning, it's the classification loss: we want our model distribution (q) to match the true distribution (p). If our model is perfect, cross-entropy equals entropy. If wrong, we waste extra bits.

Formula

H(P, Q) = −∑ p(x) log₂(q(x))

P is the true distribution. Q is our model's predicted distribution. Lower cross-entropy means a better model.

?

Quick check

If a classifier assigns p_model(class=cat)=0.01 but the true label is cat, what does high cross-entropy tell us?

Cross-entropy and KL divergence are intimately related:

H(P, Q) = H(P) + D_KL(P || Q)

This is profound: the cost of using Q to encode P equals the entropy of P plus the divergence between them. If Q = P (perfect model), D_KL = 0 and H(P, Q) = H(P). If they differ, the divergence term adds extra cost.

In machine learning, we minimize cross-entropy loss, which automatically minimizes KL divergence when H(P) is fixed. This is why cross-entropy is the fundamental loss function for classification.

KL Divergence: Measuring Distribution Mismatch

KL divergence measures how much one distribution diverges from another. Crucially, it's asymmetric: D_KL(P||Q) ≠ D_KL(Q||P). Try adjusting the distributions below to see the asymmetry in action.

KL Divergence: Asymmetry

Adjust two distributions and see how divergence is asymmetric.

Formula

D_KL(P || Q) = ∑ p(x) log₂(p(x) / q(x))

True Distribution (P)

p(1)0.333
p(2)0.333
p(3)0.333

Entropy: 1.585 bits

Predicted Distribution (Q)

q(1)0.333
q(2)0.333
q(3)0.333

Entropy: 1.585 bits

D_KL(P || Q)

0

How much Q diverges from P

D_KL(Q || P)

0

How much P diverges from Q

Asymmetry

P || Q0
Q || P0

Key insight

KL divergence is not symmetric. D_KL(P || Q) equals D_KL(Q || P) when distributions are identical.

D_KL(P||Q) measures "how much Q diverges from P" by summing p(x) log(p(x)/q(x)). When p(x) is large but q(x) is small, the term p(x) log(p(x)/q(x)) explodes. This penalizes Q for missing probability mass where P is confident.

In contrast, D_KL(Q||P) sums q(x) log(q(x)/p(x)). If P has zero probability somewhere that Q assigns mass, the term is zero (because q(x) log(∞) gets multiplied by small q(x)). This is called mode-seeking vs. mass-covering behavior.

This asymmetry matters in machine learning. Variational Autoencoders use D_KL(q_encoder || p_prior) to regularize the encoder. If they used D_KL(p_prior || q_encoder), the behavior would be different.

?

Quick check

In the formula D_KL(P || Q) = ∑ p(x) log(p(x) / q(x)), what happens if Q assigns zero probability to an outcome where P has nonzero probability?

Huffman Coding: Optimal Compression

Entropy tells us the minimum average bits needed to encode a distribution. Huffman codes achieve this optimally. Assign frequencies to symbols and watch the algorithm build an optimal binary tree.

Huffman Coding Builder

Enter symbol frequencies to build optimal binary codes.

Symbols & Frequencies

25%
25%
25%
25%

Generated Huffman Codes

SymbolCodeFreqLengthBits/Symbol
A00122
B01122
C10122
D11122

Entropy

2

bits/symbol

Avg Code Length

2

bits/symbol

Efficiency

100%

of optimal

Key insight

Huffman codes are optimal: they require on average just entropy + 1 bit per symbol. Higher entropy means more uncertainty, so you need longer codes. Skewed distributions (where some symbols are much more frequent) need fewer average bits.

Huffman's algorithm builds a binary tree by repeatedly merging the two lowest-frequency nodes. The result is an optimal prefix-free code: no code is a prefix of another, and the average code length is as short as possible.

Shannon's source coding theorem proves that the average bits per symbol is between H(X) and H(X) + 1. Huffman coding achieves this bound (or exactly H(X) if probabilities are powers of 2).

In the era of deep learning, entropy tells you the theoretical compression limit. If you want to transmit data as efficiently as possible, Shannon entropy is your guide. Practical codes like gzip and PNG use variants of these ideas.

?

Quick check

If a symbol has probability 0.5, what is the optimal code length for that symbol?

Information Theory in Deep Learning

Information theory is woven throughout machine learning. Here are the key applications:

Classification Loss

Cross-entropy loss in neural networks directly minimizes D_KL(p_true || p_model). We sample from the true distribution (labeled data) and compute −log(p_model) for each sample. This is cross-entropy. As KL divergence approaches zero, the model's predictions match reality.

Variational Autoencoders (VAEs)

VAEs minimize: Reconstruction Loss + β × D_KL(q_encoder || p_prior). The KL term regularizes the encoder to stay close to a standard prior (usually Gaussian). This enforces a bottleneck in the latent space.

Information Bottleneck

The Information Bottleneck principle states: maximize I(Z; Y) (information about the task) while minimizing I(Z; X) (redundant information from input). This explains why deep learning works: hidden layers learn compressed representations.

Mutual Information & Feature Selection

Mutual information I(X; Y) measures how much knowing feature X tells you about label Y. High mutual information means the feature is informative. Use it to rank features by importance before training.

Entropy & Calibration

A well-calibrated model outputs high entropy (uncertainty) on examples it's uncertain about, and low entropy (confidence) on examples it knows well. This calibration is crucial for decision-making under uncertainty.

Contrastive Learning & Mutual Information

Contrastive methods (SimCLR, MoCo) maximize mutual information between augmented views of the same image while minimizing it with different images. This drives the model to learn invariant representations.

?

Quick check

Which loss function does cross-entropy minimize when training a classifier?

Information theory provides a unified framework for understanding what neural networks learn. The Information Bottleneck principle suggests that training involves two phases:

1. Fitting phase: Maximize mutual information between data and hidden layers. 2. Compression phase: Compress the representation while preserving task-relevant information.

This explains why deep networks generalize: they learn to forget noisy or irrelevant features, keeping only the information needed for the task.

Additionally, information theory explains why tricks like dropout and batch normalization work: they add noise that encourages the network to find robust, compressed representations that remain informative despite perturbations.

Interactive Playground

Build your own distributions and compute all information-theoretic quantities. Switch between modes to explore single distributions, compare two distributions, or see all metrics at once.

Information Theory Playground

Create custom distributions and compute all quantities.

Distribution P

25%
25%
25%
25%

Entropy H(P)

2

Uniformity

100%

?

Quick check

In the playground, if you set two distributions to be identical, what should D_KL(P || Q) equal?

Key Takeaways

Entropy

H(X) measures the expected bits needed to describe outcomes. Maximum for uniform distributions, minimum (zero) for certainty. It's the fundamental limit on compression.

Cross-Entropy

H(P, Q) is the cost of using Q to encode P. In ML, it's the classification loss. Lower cross-entropy means better predictions. Zero extra cost only when model matches reality perfectly.

KL Divergence

D_KL(P || Q) = H(P, Q) − H(P) measures divergence. Asymmetric and always ≥ 0. Penalizes Q for missing probability mass where P is confident. The extra cost beyond optimal compression.

Optimal Coding

Huffman codes achieve the Shannon limit: average length between H(X) and H(X) + 1 bits per symbol. Proves that entropy is the theoretical compression limit.

Mutual Information

I(X; Y) measures how much knowing X tells you about Y. Zero if independent, maximum if deterministically related. Symmetric and used for feature selection and measuring correlation.

Deep Learning Connection

Cross-entropy loss minimizes KL divergence. VAEs use KL regularization. Information Bottleneck explains why deep networks learn compressed representations. Entropy guides calibration and uncertainty.

Final Checks

?

Quick check

What is the relationship between entropy and KL divergence?

?

Quick check

In VAEs, why is D_KL(q_encoder || p_prior) used instead of D_KL(p_prior || q_encoder)?

?

Quick check

What does it mean if a trained classifier outputs entropy close to zero on a test sample?

Finished this lesson?

Mark it as complete to track your progress and get a certificate.