Diffusion Models
Master the mathematical foundations and intuitions behind generative diffusion models through interactive visualizations and hands-on experiments.
1. What Are Diffusion Models?
Diffusion models are a class of generative models that learn to create new data by gradually removing noise from random samples. They work by reversing a process that progressively corrupts data.
Core Concept
Imagine a photograph being corrupted by noise until it becomes pure white noise. Now reverse time: can we reconstruct the original photo by iteratively removing the noise? This is the essence of diffusion models.
Advantages
- • Theoretically sound: reversible process
- • Stable training: easier than GANs
- • Flexible conditioning: text, images, etc.
- • High quality: SOTA image generation
Challenges
- • Slow sampling: many steps needed
- • Computationally expensive: training
- • Complex architecture: U-Net backbone
- • Hyperparameter tuning: noise schedule
Quick check
What is the main idea behind diffusion models?
2. The Forward Diffusion Process
The forward process progressively adds noise to a clean image over T timesteps. This is a fixed, non-learnable process that forms the foundation of diffusion.
Forward Diffusion Process
Watch as noise progressively corrupts the clean image over timesteps t=0 to t=T
Clean Image (t=0)
With Noise (t=0)
Signal-to-Noise Ratio (dB)
Signal Strength
100.0%
Noise Level
0.0%
Using cosine schedule
Cosine schedules maintain signal visibility longer, spending more timesteps in the useful noise range.
Mathematical Intuition
The forward process computes: x_t = sqrt(α̅_t) · x₀ + sqrt(1 - α̅_t) · ε
where α̅_t is the cumulative product of alphas up to timestep t. Higher α̅_t means more signal, less noise.
Mathematical Definition
The forward process is defined as: x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε
where x₀ is the clean image, ε is standard Gaussian noise, and ᾱ_t is the cumulative product of alphas up to step t.
Key Variables
- α_t
- Probability of keeping signal (1 - β_t)
- β_t
- Amount of noise added at step t
- ᾱ_t
- Cumulative product: ∏(α_i) for i=1..t
Important Property
The forward process is deterministic given the noise. We can compute x_t directly without iterating through all previous steps. This is crucial for training efficiency.
Quick check
In the forward process, as t increases from 0 to T, what happens?
3. Noise Schedules
How we add noise at each step significantly affects model performance. Different schedules concentrate learning in different regions of the noise spectrum.
Noise Schedule Comparison
Linear vs Cosine: How noise changes at each timestep
Linear Schedule
Cosine Schedule
Linear
α̅_T: 0.3679
SNR(T): -2.35 dB
Cosine
α̅_T: 0.0000
SNR(T): -60.00 dB
Key Difference
Cosine spends more time at low noise regions where signal is visible
Why Cosine Schedules Are Better
- •Linear spends excessive time in pure noise (t > 0.8T) where model learns less
- •Cosine maintains higher SNR longer, maximizing useful signal-noise tradeoff
- •Empirically produces better quality samples with fewer sampling steps
Linear Schedule
Simple, uniform noise increase. But spends too much time in pure noise where learning is inefficient.
Cosine Schedule
Smooth, maintains signal visibility longer. Better empirical results and faster convergence.
Quick check
Why are cosine noise schedules generally preferred over linear schedules?
4. The Reverse Diffusion Process
The reverse process is where the model learns. Given a noisy image, it predicts the noise and removes it. Iterating this process from pure noise produces new samples.
Reverse Diffusion (Denoising)
Step backward through the diffusion process, predicting and removing noise at each step
Target (Clean)
Noisy (Step 50)
Noise Pred.
Denoised
Denoising Step
At each step, the model predicts the noise added during the forward process, then removes it.
x_{t-1} = (x_t - β_t/√(1-ᾱ_t) · ε_θ) / √α_t + σ_t · z
Key Insight
The model learns to predict noise, not the image itself. This prediction is used to iteratively refine the sample from pure noise to a coherent output.
Denoising Timeline
The Reverse Step
Instead of learning to predict the image x₀ directly (which is hard), the model learns to predict the noise ε_θ(x_t, t). This prediction guides the denoising step.
x_{t-1} = (x_t - (β_t / √(1 - ᾱ_t)) · ε_θ) / √α_t + small noise
The U-Net Architecture
The model ε_θ is typically a U-Net with:
- • Encoder: downsample with convolutions
- • Bottleneck: attention mechanisms
- • Decoder: upsample to original resolution
- • Skip connections: preserve spatial information
- • Time embedding: condition on timestep t
Quick check
Why does predicting noise work better than predicting the image directly?
5. The Complete Diffusion Journey
Interactive walkthrough of all 6 steps: from understanding noise, through forward and reverse processes, to sampling and guided generation.
The Complete Diffusion Journey
A 6-step interactive walkthrough of how diffusion models work
Select a step:
Step 1: What is Noise?
Noise is random variation. In diffusion, we add Gaussian noise progressively to destroy image structure.
Random Gaussian noise has zero mean and unit variance. It's completely random but statistically structured.
Gaussian Noise Example
Key Takeaway
Diffusion models reverse a noise corruption process. Instead of learning to predict images directly (hard), they learn to predict noise (easier). This iterative denoising creates amazing samples.
Quick check
In the sampling process, what is the initial state?
6. Interactive Playground
Pick a pattern and watch it get destroyed by noise, then rebuilt step by step. Experiment with different directions and speeds.
Diffusion Playground
Pick a pattern and watch it get destroyed by noise, then rebuilt step by step
Target Pattern
Removing Noise
Step 0 / 80
Progress: 0%
Statistics
Pattern: checkerboard
Steps: 80
Current: 0 / 80
Timeline
What's Happening?
The reverse process removes predicted noise iteratively. Starting from pure noise, each step brings the image closer to a coherent sample.
Experiment Ideas
- • Watch how different patterns react to noise (checkerboard vs circle)
- • Try pausing mid-process: what is the image recognizable at step T/2?
- • Compare forward vs reverse: are they perfect inverses?
- • Notice: reverse process is non-deterministic (adds noise at each step)
Quick check
In practice, why do diffusion models need many sampling steps?
Key Takeaways
Reversibility
Diffusion models reverse a noise corruption process. This is mathematically elegant and empirically powerful.
Noise Prediction
Learning to predict noise is easier than predicting images. This is the core insight that makes diffusion work.
Iterative Generation
By iteratively denoising from pure noise, we can generate infinite variations of high-quality samples.
Conditioning & Guidance
Classifier-free guidance lets us steer generation toward specific outcomes, enabling text-to-image and other conditional tasks.
Schedule Matters
The noise schedule determines how much time is spent learning in different signal-noise regimes. Cosine schedules empirically work better.
Quick check
Which of the following is NOT a key advantage of diffusion models?
Next Steps
- Implement from scratch: Build a simple diffusion model in PyTorch using only NumPy for math utilities
- Explore architectures: Study how transformers and attention mechanisms improve diffusion model quality
- Try conditioning: Implement classifier-free guidance for text-to-image generation
- Optimize sampling: Learn about distillation and accelerated samplers (DPM-Solver, etc)
- Read papers: DDPM, DDIM, Classifier-Free Guidance, and Latent Diffusion
Finished this lesson?
Mark it as complete to track your progress and get a certificate.