Information Theory — Measuring Uncertainty and Information
What is surprise? How much information does a rare event carry? Why does machine learning use cross-entropy loss? Information theory answers these questions. Learn how to measure entropy, compute KL divergence, understand optimal coding, and apply information-theoretic principles to deep learning.
Step-by-Step Guided Lesson
Start here to learn information theory from first principles: surprise, entropy, cross-entropy, KL divergence, mutual information, and real-world applications in machine learning.
Guided Walkthrough
Learn information theory step by step: from surprise to real-world ML applications.
Step 1: Surprise (Information Content)
How much info do rare events give us?
Information is about surprise. A common event tells you little; a rare event tells you a lot.
Surprise (bits) = log₂(1 / probability)
- Rolling a 1 on a fair die: log₂(1/0.167) ≈ 2.58 bits
- Flipping heads on a fair coin: log₂(1/0.5) = 1 bit
- A certain event (p=1): log₂(1/1) = 0 bits
Example: Coin flip probabilities
Fair coin (p=0.5)
Maximum surprise
1
Biased coin (p=0.9)
Less surprising
0.152
Extreme bias (p=0.99)
Very unsurprising
0.014
Quick check
What does entropy measure?
Understanding Entropy
Entropy quantifies uncertainty. Adjust the probabilities below and watch how entropy changes. A uniform distribution (all outcomes equally likely) maximizes entropy. An extreme distribution (one certain outcome) minimizes entropy to zero.
Interactive Entropy Explorer
Adjust probabilities and watch entropy change in real time.
Formula
H(X) = −∑ p(x) log₂(p(x))
Entropy (bits)
2
Max: 2
Uniformity
100%
Nearly uniform
Entropy Range
What this means
This distribution is nearly uniform. You need close to the maximum bits to describe outcomes.
We use log base 2 because it measures information in bits. One bit is the amount of information needed to distinguish between two equally likely outcomes. If you have a fair coin (p=0.5 for each side), you need exactly 1 bit to encode the outcome: 0 for heads, 1 for tails.
With a fair 8-sided die, you need log₂(8) = 3 bits because there are 8 outcomes. With a skewed die where one face is much more likely, you need fewer bits on average.
Machine learning typically uses natural logarithm (ln) which measures in nats instead of bits. The math is identical; just a different unit. 1 nat ≈ 1.44 bits.
Quick check
Which distribution has higher entropy: [0.9, 0.1] or [0.5, 0.5]?
Cross-Entropy: The Cost of Being Wrong
Cross-entropy measures the cost of using one distribution to encode samples from another. In machine learning, it's the classification loss: we want our model distribution (q) to match the true distribution (p). If our model is perfect, cross-entropy equals entropy. If wrong, we waste extra bits.
Formula
H(P, Q) = −∑ p(x) log₂(q(x))
P is the true distribution. Q is our model's predicted distribution. Lower cross-entropy means a better model.
Quick check
If a classifier assigns p_model(class=cat)=0.01 but the true label is cat, what does high cross-entropy tell us?
Cross-entropy and KL divergence are intimately related:
H(P, Q) = H(P) + D_KL(P || Q)
This is profound: the cost of using Q to encode P equals the entropy of P plus the divergence between them. If Q = P (perfect model), D_KL = 0 and H(P, Q) = H(P). If they differ, the divergence term adds extra cost.
In machine learning, we minimize cross-entropy loss, which automatically minimizes KL divergence when H(P) is fixed. This is why cross-entropy is the fundamental loss function for classification.
KL Divergence: Measuring Distribution Mismatch
KL divergence measures how much one distribution diverges from another. Crucially, it's asymmetric: D_KL(P||Q) ≠ D_KL(Q||P). Try adjusting the distributions below to see the asymmetry in action.
KL Divergence: Asymmetry
Adjust two distributions and see how divergence is asymmetric.
Formula
D_KL(P || Q) = ∑ p(x) log₂(p(x) / q(x))
True Distribution (P)
Entropy: 1.585 bits
Predicted Distribution (Q)
Entropy: 1.585 bits
D_KL(P || Q)
0
How much Q diverges from P
D_KL(Q || P)
0
How much P diverges from Q
Asymmetry
Key insight
KL divergence is not symmetric. D_KL(P || Q) equals D_KL(Q || P) when distributions are identical.
D_KL(P||Q) measures "how much Q diverges from P" by summing p(x) log(p(x)/q(x)). When p(x) is large but q(x) is small, the term p(x) log(p(x)/q(x)) explodes. This penalizes Q for missing probability mass where P is confident.
In contrast, D_KL(Q||P) sums q(x) log(q(x)/p(x)). If P has zero probability somewhere that Q assigns mass, the term is zero (because q(x) log(∞) gets multiplied by small q(x)). This is called mode-seeking vs. mass-covering behavior.
This asymmetry matters in machine learning. Variational Autoencoders use D_KL(q_encoder || p_prior) to regularize the encoder. If they used D_KL(p_prior || q_encoder), the behavior would be different.
Quick check
In the formula D_KL(P || Q) = ∑ p(x) log(p(x) / q(x)), what happens if Q assigns zero probability to an outcome where P has nonzero probability?
Huffman Coding: Optimal Compression
Entropy tells us the minimum average bits needed to encode a distribution. Huffman codes achieve this optimally. Assign frequencies to symbols and watch the algorithm build an optimal binary tree.
Huffman Coding Builder
Enter symbol frequencies to build optimal binary codes.
Symbols & Frequencies
Generated Huffman Codes
| Symbol | Code | Freq | Length | Bits/Symbol |
|---|---|---|---|---|
| A | 00 | 1 | 2 | 2 |
| B | 01 | 1 | 2 | 2 |
| C | 10 | 1 | 2 | 2 |
| D | 11 | 1 | 2 | 2 |
Entropy
2
bits/symbol
Avg Code Length
2
bits/symbol
Efficiency
100%
of optimal
Key insight
Huffman codes are optimal: they require on average just entropy + 1 bit per symbol. Higher entropy means more uncertainty, so you need longer codes. Skewed distributions (where some symbols are much more frequent) need fewer average bits.
Huffman's algorithm builds a binary tree by repeatedly merging the two lowest-frequency nodes. The result is an optimal prefix-free code: no code is a prefix of another, and the average code length is as short as possible.
Shannon's source coding theorem proves that the average bits per symbol is between H(X) and H(X) + 1. Huffman coding achieves this bound (or exactly H(X) if probabilities are powers of 2).
In the era of deep learning, entropy tells you the theoretical compression limit. If you want to transmit data as efficiently as possible, Shannon entropy is your guide. Practical codes like gzip and PNG use variants of these ideas.
Quick check
If a symbol has probability 0.5, what is the optimal code length for that symbol?
Information Theory in Deep Learning
Information theory is woven throughout machine learning. Here are the key applications:
Classification Loss
Cross-entropy loss in neural networks directly minimizes D_KL(p_true || p_model). We sample from the true distribution (labeled data) and compute −log(p_model) for each sample. This is cross-entropy. As KL divergence approaches zero, the model's predictions match reality.
Variational Autoencoders (VAEs)
VAEs minimize: Reconstruction Loss + β × D_KL(q_encoder || p_prior). The KL term regularizes the encoder to stay close to a standard prior (usually Gaussian). This enforces a bottleneck in the latent space.
Information Bottleneck
The Information Bottleneck principle states: maximize I(Z; Y) (information about the task) while minimizing I(Z; X) (redundant information from input). This explains why deep learning works: hidden layers learn compressed representations.
Mutual Information & Feature Selection
Mutual information I(X; Y) measures how much knowing feature X tells you about label Y. High mutual information means the feature is informative. Use it to rank features by importance before training.
Entropy & Calibration
A well-calibrated model outputs high entropy (uncertainty) on examples it's uncertain about, and low entropy (confidence) on examples it knows well. This calibration is crucial for decision-making under uncertainty.
Contrastive Learning & Mutual Information
Contrastive methods (SimCLR, MoCo) maximize mutual information between augmented views of the same image while minimizing it with different images. This drives the model to learn invariant representations.
Quick check
Which loss function does cross-entropy minimize when training a classifier?
Information theory provides a unified framework for understanding what neural networks learn. The Information Bottleneck principle suggests that training involves two phases:
1. Fitting phase: Maximize mutual information between data and hidden layers. 2. Compression phase: Compress the representation while preserving task-relevant information.
This explains why deep networks generalize: they learn to forget noisy or irrelevant features, keeping only the information needed for the task.
Additionally, information theory explains why tricks like dropout and batch normalization work: they add noise that encourages the network to find robust, compressed representations that remain informative despite perturbations.
Interactive Playground
Build your own distributions and compute all information-theoretic quantities. Switch between modes to explore single distributions, compare two distributions, or see all metrics at once.
Information Theory Playground
Create custom distributions and compute all quantities.
Distribution P
Entropy H(P)
2
Uniformity
100%
Quick check
In the playground, if you set two distributions to be identical, what should D_KL(P || Q) equal?
Key Takeaways
Entropy
H(X) measures the expected bits needed to describe outcomes. Maximum for uniform distributions, minimum (zero) for certainty. It's the fundamental limit on compression.
Cross-Entropy
H(P, Q) is the cost of using Q to encode P. In ML, it's the classification loss. Lower cross-entropy means better predictions. Zero extra cost only when model matches reality perfectly.
KL Divergence
D_KL(P || Q) = H(P, Q) − H(P) measures divergence. Asymmetric and always ≥ 0. Penalizes Q for missing probability mass where P is confident. The extra cost beyond optimal compression.
Optimal Coding
Huffman codes achieve the Shannon limit: average length between H(X) and H(X) + 1 bits per symbol. Proves that entropy is the theoretical compression limit.
Mutual Information
I(X; Y) measures how much knowing X tells you about Y. Zero if independent, maximum if deterministically related. Symmetric and used for feature selection and measuring correlation.
Deep Learning Connection
Cross-entropy loss minimizes KL divergence. VAEs use KL regularization. Information Bottleneck explains why deep networks learn compressed representations. Entropy guides calibration and uncertainty.
Final Checks
Quick check
What is the relationship between entropy and KL divergence?
Quick check
In VAEs, why is D_KL(q_encoder || p_prior) used instead of D_KL(p_prior || q_encoder)?
Quick check
What does it mean if a trained classifier outputs entropy close to zero on a test sample?