Diffusion Models

Modern generative engines. Score matching, SDEs, and the transition to latent diffusion and video generation.

Estimated time: ~90 min

Study this path with flashcards

6 cards

Study →

Step 1
Denoising Diffusion Probabilistic Models (DDPM)
A generative model that learns to reverse a fixed Gaussian corruption process. Ho et al. (2020) showed that predicting the added noise with a neural network, trained by a simple MSE loss on \( T \) diffusion steps, yields state-of-the-art image synthesis — the foundation of all modern image/video diffusion.
Step 2
Score Matching
An estimation principle (Hyvärinen, 2005) that fits an unnormalised density by matching the model's score \( \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \) to the data's score. Integration-by-parts eliminates the unknown data-score, yielding a tractable objective that underlies modern score-based diffusion models.
Step 3
Denoising Score Matching
Vincent (2011) showed that the score of a Gaussian-corrupted data distribution \( q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) \) admits a closed-form target, reducing score learning to a simple regression: predict \( (\mathbf{x} - \tilde{\mathbf{x}})/\sigma^2 \). This identity is the algorithmic heart of modern diffusion models.
Step 4
Latent Diffusion
Run the diffusion process in the compressed latent space of a pretrained VAE rather than in pixel space. Latent diffusion (Rombach et al., 2022) slashes memory and compute by ~8× for images while preserving sample quality, and is the architecture behind Stable Diffusion, SDXL, SD3, and most text-to-image systems.
Step 5
Diffusion Transformers (DiT)
Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Step 6
Classifier-Free Guidance
Classifier-free guidance is a sampling trick for conditional diffusion models that combines conditional and unconditional predictions to push samples harder toward the prompt. It improves prompt adherence without a separate classifier, but too much guidance can oversaturate images and reduce diversity.