Advanced

Diffusion Models

Master the mathematical foundations and intuitions behind generative diffusion models through interactive visualizations and hands-on experiments.

~20 minutes
📚6 sections
🎯5+ knowledge checks

1. What Are Diffusion Models?

Diffusion models are a class of generative models that learn to create new data by gradually removing noise from random samples. They work by reversing a process that progressively corrupts data.

Core Concept

Imagine a photograph being corrupted by noise until it becomes pure white noise. Now reverse time: can we reconstruct the original photo by iteratively removing the noise? This is the essence of diffusion models.

Advantages

  • • Theoretically sound: reversible process
  • • Stable training: easier than GANs
  • • Flexible conditioning: text, images, etc.
  • • High quality: SOTA image generation

Challenges

  • • Slow sampling: many steps needed
  • • Computationally expensive: training
  • • Complex architecture: U-Net backbone
  • • Hyperparameter tuning: noise schedule
?

Quick check

What is the main idea behind diffusion models?

2. The Forward Diffusion Process

The forward process progressively adds noise to a clean image over T timesteps. This is a fixed, non-learnable process that forms the foundation of diffusion.

Forward Diffusion Process

Watch as noise progressively corrupts the clean image over timesteps t=0 to t=T

Signal: 100.0% | Noise: 0.0%

Clean Image (t=0)

With Noise (t=0)

Signal-to-Noise Ratio (dB)

t / TSNR (dB)

Signal Strength

100.0%

Noise Level

0.0%

Using cosine schedule

Cosine schedules maintain signal visibility longer, spending more timesteps in the useful noise range.

Mathematical Intuition

The forward process computes: x_t = sqrt(α̅_t) · x₀ + sqrt(1 - α̅_t) · ε

where α̅_t is the cumulative product of alphas up to timestep t. Higher α̅_t means more signal, less noise.

Mathematical Definition

The forward process is defined as: x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε

where x₀ is the clean image, ε is standard Gaussian noise, and ᾱ_t is the cumulative product of alphas up to step t.

Key Variables

α_t
Probability of keeping signal (1 - β_t)
β_t
Amount of noise added at step t
ᾱ_t
Cumulative product: ∏(α_i) for i=1..t

Important Property

The forward process is deterministic given the noise. We can compute x_t directly without iterating through all previous steps. This is crucial for training efficiency.

?

Quick check

In the forward process, as t increases from 0 to T, what happens?

3. Noise Schedules

How we add noise at each step significantly affects model performance. Different schedules concentrate learning in different regions of the noise spectrum.

Noise Schedule Comparison

Linear vs Cosine: How noise changes at each timestep

Linear Schedule

t / TSNR (dB)

Cosine Schedule

t / TSNR (dB)

Linear

α̅_T: 0.3679

SNR(T): -2.35 dB

Cosine

α̅_T: 0.0000

SNR(T): -60.00 dB

Key Difference

Cosine spends more time at low noise regions where signal is visible

Why Cosine Schedules Are Better

  • Linear spends excessive time in pure noise (t > 0.8T) where model learns less
  • Cosine maintains higher SNR longer, maximizing useful signal-noise tradeoff
  • Empirically produces better quality samples with fewer sampling steps
Linear schedules increase beta uniformly, spending excessive time in high-noise regions where the model learns less effectively. Cosine schedules (empirically) maintain higher signal-to-noise ratios longer, concentrating learning in the useful range. Research shows cosine schedules produce better samples.

Linear Schedule

Simple, uniform noise increase. But spends too much time in pure noise where learning is inefficient.

Cosine Schedule

Smooth, maintains signal visibility longer. Better empirical results and faster convergence.

?

Quick check

Why are cosine noise schedules generally preferred over linear schedules?

4. The Reverse Diffusion Process

The reverse process is where the model learns. Given a noisy image, it predicts the noise and removes it. Iterating this process from pure noise produces new samples.

Reverse Diffusion (Denoising)

Step backward through the diffusion process, predicting and removing noise at each step

Target (Clean)

Noisy (Step 50)

Noise Pred.

Denoised

Denoising Step

At each step, the model predicts the noise added during the forward process, then removes it.

x_{t-1} = (x_t - β_t/√(1-ᾱ_t) · ε_θ) / √α_t + σ_t · z

Key Insight

The model learns to predict noise, not the image itself. This prediction is used to iteratively refine the sample from pure noise to a coherent output.

Denoising Timeline

The Reverse Step

Instead of learning to predict the image x₀ directly (which is hard), the model learns to predict the noise ε_θ(x_t, t). This prediction guides the denoising step.

x_{t-1} = (x_t - (β_t / √(1 - ᾱ_t)) · ε_θ) / √α_t + small noise

The key insight is that predicting noise is an easier task than predicting the image directly. Noise prediction leverages the structure of the diffusion process itself. The model learns: 'What noise corrupted this image?' This guidance naturally leads to denoising and sample generation.

The U-Net Architecture

The model ε_θ is typically a U-Net with:

  • • Encoder: downsample with convolutions
  • • Bottleneck: attention mechanisms
  • • Decoder: upsample to original resolution
  • • Skip connections: preserve spatial information
  • • Time embedding: condition on timestep t
?

Quick check

Why does predicting noise work better than predicting the image directly?

5. The Complete Diffusion Journey

Interactive walkthrough of all 6 steps: from understanding noise, through forward and reverse processes, to sampling and guided generation.

The Complete Diffusion Journey

A 6-step interactive walkthrough of how diffusion models work

Select a step:

Step 1: What is Noise?

Noise is random variation. In diffusion, we add Gaussian noise progressively to destroy image structure.

Random Gaussian noise has zero mean and unit variance. It's completely random but statistically structured.

Gaussian Noise Example

Sample 1
Sample 2
Sample 3

Key Takeaway

Diffusion models reverse a noise corruption process. Instead of learning to predict images directly (hard), they learn to predict noise (easier). This iterative denoising creates amazing samples.

?

Quick check

In the sampling process, what is the initial state?

To condition generation on text or other signals, we compute two noise predictions: one unconditional and one conditional. We then combine them: ε_guided = ε_uncond + scale · (ε_cond - ε_uncond). Higher scales make the generation follow the condition more strongly. This is how DALL-E and Stable Diffusion generate images from text prompts.

6. Interactive Playground

Pick a pattern and watch it get destroyed by noise, then rebuilt step by step. Experiment with different directions and speeds.

Diffusion Playground

Pick a pattern and watch it get destroyed by noise, then rebuilt step by step

Target Pattern

Removing Noise

Step 0 / 80

Progress: 0%

Statistics

Pattern: checkerboard

Steps: 80

Current: 0 / 80

Timeline

What's Happening?

The reverse process removes predicted noise iteratively. Starting from pure noise, each step brings the image closer to a coherent sample.

Experiment Ideas

  • • Watch how different patterns react to noise (checkerboard vs circle)
  • • Try pausing mid-process: what is the image recognizable at step T/2?
  • • Compare forward vs reverse: are they perfect inverses?
  • • Notice: reverse process is non-deterministic (adds noise at each step)
?

Quick check

In practice, why do diffusion models need many sampling steps?

Key Takeaways

1️⃣

Reversibility

Diffusion models reverse a noise corruption process. This is mathematically elegant and empirically powerful.

2️⃣

Noise Prediction

Learning to predict noise is easier than predicting images. This is the core insight that makes diffusion work.

3️⃣

Iterative Generation

By iteratively denoising from pure noise, we can generate infinite variations of high-quality samples.

4️⃣

Conditioning & Guidance

Classifier-free guidance lets us steer generation toward specific outcomes, enabling text-to-image and other conditional tasks.

5️⃣

Schedule Matters

The noise schedule determines how much time is spent learning in different signal-noise regimes. Cosine schedules empirically work better.

?

Quick check

Which of the following is NOT a key advantage of diffusion models?

Next Steps

  • Implement from scratch: Build a simple diffusion model in PyTorch using only NumPy for math utilities
  • Explore architectures: Study how transformers and attention mechanisms improve diffusion model quality
  • Try conditioning: Implement classifier-free guidance for text-to-image generation
  • Optimize sampling: Learn about distillation and accelerated samplers (DPM-Solver, etc)
  • Read papers: DDPM, DDIM, Classifier-Free Guidance, and Latent Diffusion

Finished this lesson?

Mark it as complete to track your progress and get a certificate.