Diffusion Models Explained¤
-
Progressive Denoising
Learn to reverse a gradual noising process, iteratively refining random noise into coherent data
-
Stable Training
Straightforward MSE objective with no adversarial dynamics—far more stable than GANs
-
State-of-the-Art Quality
Achieves the highest quality generative results, powering DALL·E 2, Stable Diffusion, and Sora
-
Exceptional Controllability
Natural framework for conditional generation, inpainting, editing, and guidance techniques
New here?
For a one-page map of how diffusion fits next to VAEs, GANs, Flows, EBMs, and Autoregressive models, start with Generative Models — A Unified View. This page is the deep-dive on diffusion specifically.
Overview¤
Diffusion models, introduced by Sohl-Dickstein et al. (2015) and made practical by Ho, Jain & Abbeel (2020) — DDPM, are a class of deep generative models that learn to generate data by reversing a gradual noising process. Unlike GANs which learn through adversarial training or VAEs which compress to latent codes, diffusion models systematically destroy data structure through noise addition, then learn to reverse this process for generation.
What makes diffusion models special? They solve the generative modeling challenge through an elegant two-stage process: a fixed forward diffusion that gradually corrupts data into pure noise over many timesteps, and a learned reverse diffusion that progressively denoises random samples into realistic data. This approach offers unprecedented training stability, superior mode coverage, and exceptional sample quality.
The Intuition: From Ink to Water and Back¤
Think of diffusion like watching a drop of ink dissolve in water:
-
The Forward Process is like dropping ink into a glass of water and watching it gradually diffuse. At first, you clearly see the ink drop. Over time, it spreads and mixes until the water appears uniformly tinted—all structure is lost.
-
The Reverse Process is like learning to run this process backwards: starting from uniformly tinted water and gradually reconstructing the original ink drop. This seems impossible by hand, but a neural network can learn the "reverse physics."
-
The Training teaches the network to predict: "Given tinted water at some mixing stage, what did it look like one step earlier?" Repeat this prediction many times, and you recover the original ink drop from fully mixed water.
The critical insight: while the forward diffusion is fixed and simple (just add noise), the reverse process is learned and powerful. The model learns to undo corruption at every noise level, enabling generation from pure random noise.
Mathematical Foundation¤
The Forward Diffusion Process¤
The forward process defines a fixed Markov chain that gradually corrupts data \(x_0\) by adding Gaussian noise over \(T\) timesteps (typically \(T=1000\)):
where \(\beta_t \in (0,1)\) controls the variance of noise added at timestep \(t\). The complete forward chain factors as:
Key property: We can sample \(x_t\) at any arbitrary timestep directly without simulating the full chain. Defining \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\):
This can be reparameterized as:
graph LR
A["x₀<br/>(Clean Data)"] -->|"β₁"| B["x₁<br/>(Slightly Noisy)"]
B -->|"β₂"| C["x₂"]
C -->|"β₃"| D["..."]
D -->|"βₜ"| E["xₜ<br/>(Intermediate)"]
E -->|"..."| F["xₜ<br/>(Pure Noise, t = T)"]
style A fill:#c8e6c9
style F fill:#ffccc9
style E fill:#fff3e0
Intuition: As \(t \to T\), the distribution \(q(x_T | x_0)\) approaches an isotropic Gaussian \(\mathcal{N}(0, \mathbf{I})\), ensuring the endpoint is tractable pure noise. The forward process is designed so that \(\bar{\alpha}_T \approx 0\).
The posterior conditioned on the original data is also Gaussian with tractable parameters:
where:
The Reverse Diffusion Process¤
The reverse process learns to invert the forward diffusion, starting from noise \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and progressively denoising to data \(x_0\):
The complete generative process:
graph RL
F["xₜ<br/>(Pure Noise)"] -->|"Neural Network<br/>Denoising"| E["xₜ₋₁"]
E -->|"Denoise"| D["..."]
D -->|"Denoise"| C["x₂"]
C -->|"Denoise"| B["x₁"]
B -->|"Denoise"| A["x₀<br/>(Generated Data)"]
style F fill:#ffccc9
style A fill:#c8e6c9
style E fill:#fff3e0
Three Equivalent Parameterizations:
Noise Prediction (most common): The network predicts the noise \(\epsilon\) that was added:
Data Prediction: The network directly predicts the clean image:
Score Prediction: The network predicts the gradient of the log probability:
These three quantities are deterministic affine functions of each other given \(x_t\), \(\bar{\alpha}_t\). From the forward reparameterisation \(x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon\) and the score–noise identity
Tweedie's formula (Efron, 2011) further connects the score to the posterior mean:
so a network trained to predict any one of \(\{\epsilon, x_0, \text{score}\}\) implicitly produces the other two.
Mathematical Equivalence
Predicting noise is equivalent to predicting the score function (and to predicting the clean image), unifying DDPM-style diffusion models with score-based generative modeling. This is also what makes parameterisation choice (ε / x₀ / v) primarily a numerical-conditioning decision, not a modelling one.
The ELBO Derivation¤
Diffusion models are Markovian hierarchical VAEs. The evidence lower bound decomposes as:
For Gaussian posteriors, the KL divergence terms simplify. The key loss term becomes:
Substituting the reparameterization yields:
Ho et al.'s key empirical finding: The simplified objective works better:
This reduces training to simple mean-squared error between predicted and actual noise!
Variance Schedules¤
The noise schedule \(\{\beta_1, \ldots, \beta_T\}\) fundamentally affects training and sampling quality:
Linear Schedule (Ho et al., 2020):
Typically \(\beta_1 = 0.0001\), \(\beta_T = 0.02\). Simple but can add too much noise early.
Cosine Schedule (Nichol & Dhariwal, 2021 — IDDPM):
with \(s = 0.008\). Provides smoother transitions and empirically superior performance.
Schedule Selection
The cosine schedule has become the de facto standard due to its superior empirical performance. It provides more balanced denoising across timesteps and avoids adding excessive noise in early steps.
Score-Based Perspective¤
The score function \(\nabla_x \log p(x)\) points toward regions of higher probability density. Score-based models (Song & Ermon, 2019 — NCSN) train a network \(s_\theta(x, t)\) to approximate this gradient field through denoising score matching (Vincent, 2011):
Given the learned score, generation proceeds via Langevin dynamics:
The connection to diffusion: the score equals the negative scaled noise.
Stochastic Differential Equations¤
The continuous-time formulation (Song et al., 2021 — ICLR Outstanding Paper) generalizes discrete diffusion as an SDE:
Variance Preserving (VP) SDE corresponds to DDPM:
The reverse-time SDE enables generation:
There exists an equivalent probability flow ODE:
This ODE formulation enables exact likelihood computation and deterministic sampling.
Architecture Design¤
The diffusion backbone is the network \(\epsilon_\theta(x_t, t, c)\) that predicts noise / score / data given a noisy input, a timestep, and (optionally) a conditioning signal. The 2020–2022 generation of diffusion models was dominated by U-Nets; from 2023 onward, transformer backbones (DiT, MMDiT, and their many descendants) have largely taken over for new large-scale models, with state-space and linear-attention alternatives now in active competition. This section walks through both, in roughly chronological order.
U-Net Backbone with Skip Connections¤
The U-Net architecture (Ronneberger et al., 2015) dominates classical diffusion models through its encoder-decoder structure with skip connections; the diffusion-specific adaptation is due to Ho et al. (2020) (DDPM) and refined in Nichol & Dhariwal (2021) (IDDPM) and Dhariwal & Nichol (2021) — ADM:
graph TB
subgraph "Encoder (Downsampling)"
A["Input Image<br/>256×256×3"] --> B["Conv + ResBlock<br/>128×128×128"]
B --> C["Conv + ResBlock<br/>64×64×256"]
C --> D["Conv + ResBlock<br/>32×32×512"]
D --> E["Bottleneck<br/>16×16×1024"]
end
subgraph "Decoder (Upsampling)"
E --> F["Upsample + ResBlock<br/>32×32×512"]
F --> G["Upsample + ResBlock<br/>64×64×256"]
G --> H["Upsample + ResBlock<br/>128×128×128"]
H --> I["Output<br/>256×256×3"]
end
D -.->|"Skip Connection"| F
C -.->|"Skip Connection"| G
B -.->|"Skip Connection"| H
style E fill:#fff3e0
style A fill:#e1f5ff
style I fill:#c8e6c9
Key Components:
- Contracting path: Progressive downsampling (256→128→64→32→16) while increasing channels (the diagram above; production U-Nets often go one or two levels deeper).
- Expanding path: Upsampling reconstructs output at original resolution
- Skip connections: Critical for propagating spatial details lost in bottleneck
- ResNet blocks: \(\text{output} = \text{input} + F(\text{input}, \text{time\_emb})\)
- Group normalization: Dividing channels into groups (~32) for stability
Why U-Net for Diffusion?
- Input and output have identical dimensions (essential for iterative refinement)
- Skip connections preserve fine details through bottleneck
- Multi-scale processing captures both coarse structure and fine texture
- No information bottleneck—maintains full spatial information
Production users: SD 1.x, SD 2.x, Imagen, DALL·E 2's decoder, ControlNet, and most pre-2023 open checkpoints. As of 2024, the SDXL U-Net (2.6 B params) is roughly the upper end of practical U-Net scaling; new large-scale models have moved to transformer backbones.
Beyond U-Net: Modern Architecture Alternatives (2023–2026)¤
By 2023, three pressures pushed the field beyond pure U-Nets: scaling (U-Nets plateau past a few billion parameters), multimodality (text + image streams want first-class joint attention), and efficiency at high resolution (quadratic self-attention dominates the budget at 1024² and above). The result is a small zoo of transformer-, state-space-, and hybrid-attention backbones. Three influential surveys / blogposts collect the lineage: From U-Nets to DiTs (ICLR Blogposts 2026), Efficient Diffusion Models: A Comprehensive Survey (2024) and Efficient Diffusion Models: A Survey (2025).
Diffusion Transformer (DiT)¤
DiT (Peebles & Xie, 2023 — ICCV) replaces the U-Net entirely with a Vision-Transformer-style stack:
- Patchification: the noisy latent \(x_t\) is cut into patches (typically 2×2) and linearly projected to tokens.
- Standard transformer blocks (LayerNorm → MHSA → MLP) operate on the token sequence.
- adaLN-Zero conditioning (see below) injects the timestep + class-label embedding.
- Final unpatchify projects the token sequence back to the latent grid.
Headline finding: scaling DiT compute (Gflops) reliably reduces FID, with no sign of a saturation point through DiT-XL/2 (675 M params). DiT-XL/2 reaches FID 2.27 on ImageNet 256² without classifier-free guidance, the first transformer to clearly beat its U-Net contemporaries on this benchmark.
Influence: DiT is the architectural ancestor of Sora, Stable Diffusion 3, and FLUX.1.
graph TD
A["Latent x_t<br/>H×W×C"] --> B["Patchify<br/>2×2"]
B --> C["Token sequence<br/>N×D"]
C --> D["Transformer block × L<br/>(adaLN-Zero conditioning)"]
T["Timestep t<br/>+ class y"] --> D
D --> E["Unpatchify"]
E --> F["Predicted noise / score<br/>H×W×C"]
style D fill:#e1f5ff
style T fill:#fff9c4
adaLN, adaLN-Zero, adaLN-single, adaLN-LoRA — Conditioning Variants¤
How a transformer-style diffusion backbone is conditioned on the timestep (and optionally class label / text vector / pooled prompt) is itself a design axis with surprising consequences:
- adaLN (Perez et al., 2018 → DiT): replace LayerNorm's affine parameters with \((\gamma, \beta)\) that are learned linear functions of the conditioning embedding. Each transformer block has its own \((\gamma, \beta)\) — much more expressive than feeding the timestep as an extra token.
- adaLN-Zero (Peebles & Xie, 2023): zero-initialise the affine projection so that each DiT block starts as the identity function; the network gradually learns the modulation. Empirically halves FID at matched compute versus vanilla adaLN, and is now the default for class-conditional DiTs.
- adaLN-single (Chen et al., 2023 — PixArt-α): share one set of adaLN parameters across all blocks instead of per-block, with a small per-block residual. Cuts ~20 % of parameters and still trains stably.
- adaLN-LoRA (Lu et al., 2024 — FiTv2): factorise the per-block adaLN projection through a low-rank decomposition. Lighter than adaLN, more expressive than adaLN-single.
Unveiling the Secret of AdaLN-Zero (Wei et al., 2024) gives a careful theoretical analysis of why the zero-init form converges so much faster.
U-ViT — Long Skip Connections in a Transformer¤
U-ViT (Bao et al., 2023 — CVPR) is the bridge between U-Nets and DiTs: it processes patch tokens with a standard transformer, but adds long skip connections from early to late blocks (the transformer analogue of U-Net's symmetric encoder–decoder skips). The paper shows the long skips are essential — without them quality collapses — clarifying that what U-Nets bring to diffusion is not the convolution but the symmetric multi-scale skip structure. FID 2.29 on ImageNet 256², competitive with DiT.
PixArt-α, PixArt-Σ — Efficient Text-Conditional DiTs¤
PixArt-α (Chen et al., 2023) is the first DiT to demonstrate that a small, training-efficient DiT can match Stable Diffusion at text-to-image:
- Cross-attention to T5 text features after each self-attention block.
- adaLN-single for cheap timestep conditioning.
- Three-stage training curriculum (pixel dependency → text alignment → high resolution).
- Trained for ~675 A100-days — about 12 % of SD 1.5's compute budget — and matches SDXL on aesthetic FID.
PixArt-Σ (Chen et al., 2024) scales to direct 4K-resolution generation via a weak-to-strong training schedule and a token-compression block that linearises attention over high-resolution patches.
MMDiT (SD3) — Multimodal Dual-Stream Transformer¤
MMDiT (Esser et al., 2024 — Stable Diffusion 3) is the architectural innovation behind SD3 / SD3.5 and the closest cousin of FLUX.1:
- Two parallel token streams — one for text, one for image — each with its own QKV/MLP weights.
- Joint self-attention across the concatenated streams: every text token can attend to every image token and vice versa, in both directions.
- adaLN-Zero conditioning on a pooled text/timestep vector.
graph TB
T["Text tokens<br/>(T5 + CLIP-G)"] --> TS["Text stream<br/>(QKV_text, MLP_text)"]
I["Noisy image patches"] --> IS["Image stream<br/>(QKV_image, MLP_image)"]
TS --> J["Joint self-attention<br/>(over concat[text, image])"]
IS --> J
J --> TS2["Updated text tokens"]
J --> IS2["Updated image tokens"]
style J fill:#e1f5ff
style TS fill:#fff9c4
style IS fill:#fff3e0
The dual-stream design dramatically outperforms cross-attention on prompt following and text rendering (the classic "spell-the-word-correctly" test). FLUX.1 (Black Forest Labs, 2024) extends this with a hybrid of "double-stream" MMDiT blocks early and cheaper "single-stream" parallel-attention blocks late, plus rotary positional embeddings throughout.
Hourglass Diffusion Transformer (HDiT)¤
HDiT (Crowson et al., 2024 — ICML) reintroduces the U-Net's multi-level hierarchy inside a pure-transformer architecture:
- Hierarchical encoder–decoder with symmetric token-resolution stages (2× downsample / upsample between stages, like U-Net).
- Neighborhood Attention at high resolutions (linear in pixel count) and standard global self-attention at the bottleneck.
- Scales linearly with pixel count rather than quadratically.
This makes HDiT one of the few transformer backbones that trains efficiently in pixel space at 1024² — without a VAE codec, multiscale loss, or self-conditioning. Sets a new SOTA for diffusion on FFHQ-1024² while remaining competitive on ImageNet-256².
SiT — Scalable Interpolant Transformers¤
SiT (Ma et al., 2024 — ECCV) keeps DiT's exact architecture but replaces the discrete-time DDPM training objective with the continuous stochastic-interpolants framework — a generalisation of flow matching that decouples interpolant choice, prediction target, and sampler. The paper systematically ablates each axis and shows SiT-XL/2 reaches FID 2.06 on ImageNet 256² and 2.62 on 512², both improvements over DiT-XL/2 at the same parameter count and FLOPs. SiT is the backbone REPA (see below) is built on top of.
FiT / FiTv2 — Flexible-Resolution DiTs¤
FiT (Lu et al., 2024 — ICML Spotlight) and FiTv2 (Lu et al., 2024) treat images as variable-length token sequences, not static-resolution grids:
- 2-D RoPE (Su et al., 2024) instead of learned position embeddings → genuine generalisation to unseen resolutions.
- SwiGLU MLPs and Q-K vector normalisation for training stability at scale.
- Masked MHSA to handle variable-length padding cleanly.
- FiTv2 adds adaLN-LoRA, a rectified-flow scheduler, and a Logit-Normal sampler, doubling FiT's convergence speed.
The practical payoff is that one FiT checkpoint can generate at arbitrary aspect ratios (portrait, landscape, square) without the cropping artefacts of fixed-grid DiTs.
Hunyuan-DiT — Multi-Resolution Bilingual Text-to-Image¤
Hunyuan-DiT (Tencent, 2024) is a 1.5 B-parameter DiT that interleaves PixArt-α-style single-stream blocks (cross-attention to text) with SD3-style double-stream blocks (joint attention). It is engineered for bilingual (English + Chinese) prompts, with a custom text encoder pipeline and rotary positional encoding. A blueprint that several closed Chinese labs have since extended.
Lumina-Next, DiT-Air — Parameter-Efficient Multi-Modal DiTs¤
- Lumina-Next (Alpha-VLLM, 2024) — a "Next-DiT" backbone that unifies image, video, audio, and 3D generation behind a single MMDiT-style transformer with rectified-flow training; deployed in Lumina-Image 2.0 (March 2025).
- DiT-Air (Chen et al., 2025) — applies layer-wise parameter sharing to MMDiT, achieving a 66 % size reduction at near-equal quality. Currently the most parameter-efficient member of the MMDiT family.
- E-MMDiT (2025) — pushes MMDiT to a 304 M-parameter lightweight variant via a highly compressive tokenizer plus Alternating Subregion Attention.
Mixture-of-Experts Diffusion: DiT-MoE, EC-DiT, Switch-DiT¤
Sparse Mixture-of-Experts has finally crossed over from LLMs into diffusion:
- DiT-MoE (Fei et al., 2024) — scales to 16.5 B parameters with shared-expert routing and an expert-balance loss; FID 1.80 on ImageNet 512² while activating only 3.1 B parameters per token. The first sparse-diffusion model to break under FID 2 at this scale.
- Switch-DiT (Park et al., 2024 — ECCV) — synergises denoising-task experts with a sparse routing layer; different timesteps select different experts.
- EC-DiT (Apple, 2024) — adaptive expert-choice routing where experts pick which tokens to process, balancing expert load implicitly.
A 2025 study (Expert Specialization Analysis, 2024) finds that diffusion experts naturally specialise by spatial position and timestep — not by class label — providing a clean argument for why MoE is a particularly good fit for the iterative-denoising structure of diffusion.
State-Space and Linear-Attention Backbones (2024–2026)¤
Quadratic self-attention is the bottleneck at high resolution. Three lines of work replace it:
- Diffusion Mamba (DiM) (Teng et al., 2024) — Mamba selective state-space layers in place of attention, \(O(N)\) in token count; competitive at 1024² with far less compute than DiT.
- DiMSUM (Phung et al., 2024 — NeurIPS) — adds wavelet-domain processing to Diffusion Mamba, capturing local structure and long-range frequency relations that pure SSMs miss.
- U-Shape Mamba (USM) (Ergasti et al., 2025 — CVPRW) — wraps Mamba blocks in a U-shaped hierarchy, halving sequence length per encoder stage.
- DiG (Zhu et al., 2025 — CVPR) — Diffusion Gated Linear Attention; 1.8× faster than Flash-DiT-XL/2 at 2048² while matching ImageNet-512² FID.
- EDiT / MM-EDiT (Suresh et al., 2025) — linear compressed attention for image-to-image dependencies, plus standard softmax attention only for image-text interactions; linear in resolution at production quality.
Hybrid backbones (linear + softmax attention mixed in fixed ratios, e.g. 3:1) have been the dominant 2026 LLM trend and are now appearing in diffusion as well — see Ant Group's ICLR 2026 expo talk on trillion-scale hybrid attention.
Dynamic / Sparse / Cached DiTs¤
A complementary 2024–2026 thread leaves the architecture roughly fixed but skips computation adaptively:
- DyDiT / DyDiT++ (Zhao et al., 2024–2025) — adjusts compute along both timestep and spatial dimensions; cuts DiT-XL FLOPs by 51 % at competitive FID.
- Learning-to-Cache (Ma et al., 2024 — NeurIPS) — caches static layer activations across denoising steps; ~40 % faster inference for free.
- DiffSparse (ICLR 2026) — differentiable layer-wise token sparsity with a learnable router and DP solver; further accelerates DiTs without retraining the base model.
- TeaCache — feature caching baked into multiple production stacks (Lumina-Image 2.0, FLUX) for ~2× inference speedup.
Autoregressive Challengers (2024) — VAR and MAR¤
A small-but-significant 2024 thread uses autoregression over discrete or continuous visual tokens to challenge diffusion on its own benchmarks:
- VAR — Visual Autoregressive Modeling (Tian et al., 2024 — NeurIPS Best Paper) — next-scale prediction over multi-resolution VQ codes. Beats DiT on ImageNet-256² (FID 1.73 vs DiT-XL/2's 2.27) while being ~20× faster at inference.
- MAR — Masked Autoregressive Modeling (Li et al., 2024) — autoregression without vector quantisation; uses a small per-token diffusion head over continuous values. Matches diffusion quality with autoregressive sampling.
- VAR-Image / VAR-Video follow-ups extend next-scale prediction to higher resolutions and video.
These are not diffusion models, but they pressure-test the assumption that multi-step denoising is the only path to high-quality image generation; the unified picture in 2026 is that DiT, MAR, and VAR are closer to each other than to U-Net DDPM.
Choosing a Backbone: a 2026 Cheat-Sheet¤
| Goal | Recommended backbone | Notes |
|---|---|---|
| Small-scale class-conditional ImageNet | DiT / SiT + adaLN-Zero | Cleanest scaling laws; adopt SiT objective for an easy quality bump. |
| Large-scale text-to-image, prompt-following critical | MMDiT (SD3-style) or FLUX.1-style hybrid | Dual-stream attention is decisive for text rendering. |
| Pixel-space, 1024²+ | HDiT | Linear scaling; no VAE needed. |
| Variable aspect ratios | FiT / FiTv2 | 2-D RoPE generalises cleanly; no cropping. |
| Maximum scale (10B+) | DiT-MoE / EC-DiT | Sparsity is the only way past the dense 8B wall in 2026. |
| Long-sequence / high-res efficiency | DiG, DiM / DiMSUM, EDiT | Linear-attention or SSM backbones beat softmax above ~2048 tokens. |
| Production fast inference | DiT/MMDiT + Learning-to-Cache + DyDiT + distillation | Architecture is half the story; the other half is in Recent Advances → Few-Step Distillation. |
| Bilingual (zh/en) text-to-image | Hunyuan-DiT | Battle-tested for Chinese text rendering. |
| Unified multimodal (image + video + audio) | Lumina-Next / Lumina-Image 2.0 | Single backbone across modalities; rectified-flow training. |
The unifying thread: the transformer block (with adaLN-Zero conditioning, RoPE positions, and sometimes long skip connections) has replaced the convolutional U-Net as the default unit of computation, while orthogonal axes — sparsity, linear attention, multi-resolution hierarchy, dual-stream multimodal — combine freely on top.
Time / Noise-Level Conditioning¤
Denoising behaviour is strongly noise-level dependent: at high noise the network's job is to recover global structure, at low noise it sharpens fine texture. The same set of weights has to switch between these regimes smoothly, so the timestep \(t\) (or equivalently the noise scale \(\sigma\) or log-SNR \(\lambda = \log(\bar\alpha_t / (1-\bar\alpha_t))\)) must be fed to every denoising block, not just the input.
There are two distinct sub-problems here, often conflated in implementations:
- Encoding — turn the scalar \(t\) (or \(\sigma\), \(\lambda\)) into a vector \(\tau \in \mathbb{R}^{d_\tau}\) that the network can consume.
- Injection — make \(\tau\) influence every layer's computation.
This subsection walks through both.
1. Encoding the Scalar¤
Sinusoidal Positional Embedding (DDPM Default)¤
The classical choice, inherited from Transformer position encodings (Vaswani et al., 2017):
for \(i = 0, \ldots, d/2 - 1\) and embedding dimension \(d\) (typically 128–256). The fixed log-spaced frequencies make \(\tau(t)\) uniquely identify \(t\) over a long range and provide multi-scale information. Used by DDPM, IDDPM, ADM, all SD 1.x / 2.x checkpoints, and most U-Net diffusion code in the wild.
Random Fourier Features¤
Tancik et al. (2020 — NeurIPS) showed that passing scalar inputs through a random Fourier feature map dramatically improves a network's ability to fit high-frequency functions:
The bandwidth \(\sigma_B\) controls the spectrum of frequencies the network sees — higher \(\sigma_B\) exposes higher-frequency variation in \(t\). NCSN++ (Song et al., 2021), the EDM family (Karras et al., 2022), and most score-based-SDE codebases use random Fourier features in place of fixed sinusoids; the empirical gain is most visible in continuous-time models with very wide \(\sigma\) ranges where the fixed-base sinusoidal frequencies under-resolve the low-noise tail.
Log-SNR / Log-σ Encoding¤
A second axis is what scalar to encode. Three common choices:
- Discrete index \(t \in \{1, \ldots, T\}\) — DDPM's default; simple but tied to a specific schedule.
- Continuous time \(t \in [0, 1]\) — used by score-based SDEs and rectified flow; portable across schedules.
- Log-SNR \(\lambda = \log(\bar\alpha_t / (1-\bar\alpha_t))\) (Kingma et al., 2021 — VDM; Salimans & Ho, 2022) — the "natural coordinate" of diffusion. Log-SNR varies linearly with the network's effective task difficulty, so feeding it directly (rather than \(t\)) is schedule-invariant and numerically stable across very wide noise ranges.
- EDM σ-conditioning (Karras et al., 2022) — feed \(c_{\text{noise}}(\sigma) = \tfrac{1}{4}\log \sigma\) through random Fourier features. The \(\tfrac14 \log\) rescaling keeps the embedding's dynamic range stable across the EDM sampler's geometric \(\sigma\) schedule.
In practice, rectified-flow / flow-matching models (SD3, FLUX.1, modern video DiTs) feed the continuous time \(t \in [0, 1]\) through standard sinusoidal encoding; EDM-line Karras models feed \(\log\sigma\) through random Fourier features; U-Net DDPM models feed the discrete step \(t\) through standard sinusoidal encoding. All three work — what matters is that the chosen scalar varies smoothly with the task difficulty the network actually faces.
Learned-Table (Embedding-Lookup) Encoding¤
If you commit to a fixed discrete schedule of \(T\) timesteps, you can simply learn a \(T \times d\) embedding table:
Used in some early DDPM variants and conditional models. Cheap and expressive, but cannot generalise to unseen timesteps (e.g. you cannot DDIM-skip-sample with a learned-table model trained at \(T=1000\) unless the chosen sub-schedule was seen during training). Mostly displaced by sinusoidal / Fourier encodings, which interpolate naturally.
Other Encodings (Briefly)¤
- Multi-resolution hash encodings (Müller et al., 2022 — Instant-NGP) are popular in NeRFs but have not been broadly adopted for time conditioning in diffusion (the scalar input is too low-dimensional to benefit).
- Hybrid sinusoidal + learned-bias — sinusoidal frequencies plus a small learned offset table per timestep; sometimes seen in distilled / few-step student models.
2. Injecting the Embedding¤
The encoded vector \(\tau(t)\) is typically fed through a 2-layer MLP and then injected into every transformer / U-Net block. The five common patterns:
FiLM (Feature-wise Linear Modulation) — U-Net Default¤
with per-channel scale \(\gamma\) and shift \(\beta\) produced from \(\tau\) via a small MLP. Applied immediately after the GroupNorm in each ResBlock. The dominant choice in U-Net diffusion (DDPM, ADM, SD 1.x / 2.x, Imagen). Compute-cheap and very effective.
adaLN-Zero — Transformer Default¤
For DiT-style backbones, the equivalent is Adaptive Layer Normalisation, which replaces every LayerNorm's affine parameters with a \(\tau\)-dependent \((\gamma, \beta)\) produced by a per-block MLP. adaLN-Zero (Peebles & Xie, 2023) zero-initialises the MLP so each block starts as the identity function and gradually learns its modulation — empirically halves FID at matched compute versus naive LayerNorm. Variants:
- adaLN-single (Chen et al., 2023 — PixArt-α) — share one \((\gamma, \beta)\) projection across all blocks; ~20 % parameter savings.
- adaLN-LoRA (Lu et al., 2024 — FiTv2) — factorise the per-block projection through a low-rank decomposition.
See Beyond U-Net → adaLN, adaLN-Zero, adaLN-single, adaLN-LoRA for the full discussion.
Concatenation as Extra Channels¤
The simplest scheme: tile \(\tau\) spatially and concatenate it to the input image as extra channels. Used by the original DDPM paper and a few early variants. Cheap, but only conditions the first layer, leaving deeper layers to propagate timestep information through learned features — empirically much weaker than FiLM / adaLN.
Time Token Prepending (DiT Variants)¤
Some early DiT variants tokenise \(\tau\) as an extra "[time]" sequence element and let standard self-attention spread the conditioning across all image tokens. Less popular today than adaLN-Zero, which conditions every layer with no extra tokens.
Cross-Attention to a Conditioning Sequence¤
For models where time and other conditions (text, class, pooled CLIP) are tightly coupled, the entire conditioning vector — including \(\tau\) — is fed as the key/value sequence in a cross-attention layer. The MMDiT recipe in SD3 and FLUX.1 adds the timestep embedding to the pooled-text embedding before adaLN, then uses joint self-attention across image and text tokens; this combines the strengths of FiLM-style modulation and full attention-based text-conditioning.
3. Practical Considerations¤
Joint timestep + class / text conditioning. When training a class-conditional or text-conditional diffusion model, the conditioning vector is typically built as \(c = \tau(t) + \text{embed}(y)\) (for class labels) or \(c = \tau(t) + \text{pool}(\text{CLIP}(y))\) (for text). The same FiLM / adaLN injection then carries both signals.
Null embedding for classifier-free guidance. CFG requires the network to also handle the unconditional case during training. The standard implementation reserves a learned null embedding \(\emptyset\) that replaces the class / text conditioning with probability ~10 %, while the timestep embedding is still passed through normally.
EMA of the embedding MLP. The MLP that produces \(\tau\) is small (a few thousand parameters) and is typically EMA-averaged along with the rest of the network — no special handling needed.
4. A Recent Surprise: Is Noise Conditioning Even Necessary?¤
A 2025 line of work pushes back on the orthodoxy that timestep conditioning is essential:
- Sun et al. (2025 — ICML) — Is Noise Conditioning Necessary for Denoising Generative Models? — empirically demonstrates that across a wide range of diffusion-style models, removing time conditioning entirely produces only a small quality gap; the noisy input itself implicitly encodes the noise level for the network to read off.
- Follow-ups show similar results for graph diffusion (NeurIPS 2025) and disjoint-manifold diffusion (2026).
The takeaway is not that timestep conditioning is useless — adding it still wins in most settings — but that the gap is much smaller than one might assume, and the exact encoding (sinusoidal vs Fourier vs log-σ) usually matters less than the injection (FiLM / adaLN-Zero) and the what-scalar (t vs σ vs log-SNR) decisions above.
5. A 2026 Practical Cheat-Sheet¤
| Setting | Scalar | Encoding | Injection |
|---|---|---|---|
| U-Net DDPM (SD 1.x style) | discrete \(t\) | sinusoidal | FiLM in ResBlocks |
| Continuous-time score-based SDE / NCSN++ | \(t \in [0,1]\) or \(\sigma\) | random Fourier features | FiLM |
| EDM-line Karras models | \(\sigma\) via \(c_{\text{noise}}(\sigma) = \tfrac14 \log\sigma\) | random Fourier | FiLM with EDM preconditioning |
| Variational Diffusion Models | log-SNR \(\lambda\) | sinusoidal of \(\lambda\) | FiLM |
| DiT class-conditional | discrete \(t\) | sinusoidal + 2-layer MLP | adaLN-Zero |
| PixArt-α / -Σ | discrete \(t\) | sinusoidal | adaLN-single |
| MMDiT (SD3 / FLUX.1) | continuous \(t\) | sinusoidal added to pooled text | adaLN-Zero + joint attention |
| FiTv2 | continuous \(t\) | sinusoidal | adaLN-LoRA |
| Distilled few-step student | inherited from teacher | inherited | inherited; sometimes one fixed timestep + null embedding |
The unifying picture: pick a scalar that varies smoothly with task difficulty (log-SNR or σ if your range is wide, \(t\) otherwise), encode it with sinusoidal or random-Fourier features into ~256 dimensions, and inject through FiLM for U-Nets / adaLN-Zero for transformers. That recipe accounts for ≥ 95 % of production diffusion models in 2026.
Attention Mechanisms¤
In a modern DiT, attention is the dominant compute cost — typically 60–80 % of the FLOPs, and a much larger fraction of the wall-clock time at high resolution. The attention design space has therefore exploded: there are now ten or so distinct attention patterns coexisting in production diffusion stacks, each making a different trade-off between expressivity, complexity, and hardware fit.
Self-Attention¤
Every spatial token attends to every other spatial token via standard scaled-dot-product attention (Vaswani et al., 2017):
In multi-head form, the projection is split into \(h\) heads of dimension \(d_k / h\), attended in parallel, then concatenated and re-projected. Different heads learn to focus on different spatial relationships (long-range structure, local texture, object boundaries).
Complexity is \(O(N^2 d)\) in the number of tokens \(N\). For a 256² latent at f8 patch-2 (i.e. 32² latent → 16² patches), \(N = 256\) and self-attention is cheap; for a 1024² pixel-space input, \(N \sim 65{,}536\) and full self-attention is intractable. The classical U-Net heuristic: only apply self-attention at the bottleneck and lower-resolution stages (typically 32², 16², 8²), rely on convolutions at higher resolutions. Modern DiTs apply self-attention at every layer — patchification keeps \(N\) small enough — but this then makes attention's \(O(N^2)\) cost the binding constraint at high resolution, motivating most of the variants below.
Cross-Attention (for Text Conditioning)¤
Cross-attention is the standard mechanism for text-to-image conditioning in U-Net-era diffusion (Stable Diffusion 1.x / 2.x, ControlNet, IP-Adapter, etc.):
- Queries \(Q\) come from image / patch features.
- Keys \(K\) and Values \(V\) come from a text encoder's output (CLIP, T5, OpenCLIP-G, T5-XXL).
Different image regions thereby attend to different text tokens, giving spatially-localised text control.
graph LR
subgraph "Image stream"
A["Image / patch features"] --> B["Query Q"]
end
subgraph "Text conditioning"
C["Text embeddings<br/>(CLIP / T5)"] --> D["Keys K"]
C --> E["Values V"]
end
B --> F["Cross-Attention"]
D --> F
E --> F
F --> G["Conditioned features"]
style A fill:#e1f5ff
style C fill:#fff9c4
style G fill:#c8e6c9
The text-side \(K, V\) are shared across denoising steps and across image positions, so cross-attention adds only \(O(N \cdot M \cdot d)\) extra cost where \(M\) is the (typically short, ≤ 77 or ≤ 256) text-token count.
Cross-attention is also how PixArt-α / -Σ inject text into a DiT, where it sits next to every self-attention block and is conditioned on a frozen T5-XXL (Chen et al., 2023).
Joint Attention — MMDiT-Style Multimodal¤
Stable Diffusion 3, FLUX.1, Hunyuan-DiT, Lumina-Next, and most other 2024–2026 large-scale text-to-image models use joint attention instead of cross-attention: text and image tokens are concatenated into a single sequence and run through a single self-attention block, with each modality having its own QKV/MLP weights but a shared attention computation.
This lets every text token attend to every image token and vice versa, in a single bidirectional operation. See the MMDiT subsection above for the architecture diagram. Empirically, joint attention is the single biggest reason MMDiT-style models render text inside images so much better than cross-attention U-Nets.
Position Encodings: Sinusoidal → Learned → RoPE → 2-D RoPE¤
Self-attention is position-agnostic on its own — Q-K dot products are invariant to token reordering. The positional information that distinguishes "top-left patch" from "bottom-right patch" comes from the position encoding. Modern diffusion has converged on two choices:
- Learned absolute position embeddings (one vector per token slot) — simple but tied to a fixed grid; doesn't generalise to other resolutions.
- Rotary Position Embeddings (RoPE) (Su et al., 2024) — multiply \(Q\) and \(K\) by a complex-rotation matrix that depends on token position; the dot product \(\langle Q_i, K_j \rangle\) then encodes the relative position \(i - j\). The dominant choice in 2026 diffusion DiTs.
- 2-D RoPE factorises the rotation across \((H, W)\) axes; NaViT-style variable-resolution RoPE (Dehghani et al., 2023) supports arbitrary aspect ratios. FiT / FiTv2 (Lu et al., 2024) and FLUX.1 are built on this.
Spatial, Temporal, and 3-D Attention in Video Diffusion¤
Naive 3-D attention over a video latent is prohibitively expensive — for \(T \times H \times W\) tokens, complexity is \(O((THW)^2)\). Three strategies dominate:
- Factorised space-time attention — interleave 2-D spatial attention blocks (treating \(T\) as a batch dimension) with 1-D temporal attention blocks (treating \(HW\) as a batch dimension). Used in Imagen Video, AnimateDiff, Stable Video Diffusion, and most early video-DiT recipes. \(O(THW \cdot (HW + T))\).
- Full 3-D attention with spatio-temporal patchification — Sora-style. Patch the video at \((t, h, w)\) resolution to keep \(N\) tractable, then run unified self-attention across all spatio-temporal tokens. Used by Sora, Wan 2.x, HunyuanVideo, Mochi-1.
- Sparse / structured 3-D attention — neighbourhood attention in \((t, h, w)\), axial attention, or learned sparse patterns — driving efficient long-context video generation in 2025–2026 (e.g. LiteAttention for video DiTs, arXiv:2511.11062; FrameDiT matrix attention, arXiv:2603.09721).
Windowed and Neighborhood Attention (Sub-Quadratic, Sparse)¤
For pixel-space and very high-resolution latents, full self-attention is replaced by local attention windows:
- Window Self-Attention (Liu et al., 2021 — Swin) — partition tokens into non-overlapping windows; attend within each window, then shift the windows between layers to mix information.
- Neighborhood Attention (NA / NAT) (Hassani et al., 2023 — CVPR) — every token attends to its \(k\) nearest neighbours in a sliding window. Linear in \(N\), translation-equivariant (unlike Swin's shifted windows), and the attention pattern of choice for Hourglass DiT at high resolution. The NATTEN extension provides hand-tuned CUDA kernels.
- Dilated / mixed windows combine multiple receptive-field sizes per layer.
Linear-Attention Variants (Beyond Softmax)¤
A small zoo of \(O(N)\) attention alternatives has crossed over from LLMs into diffusion (covered in detail in Beyond U-Net → State-Space and Linear-Attention Backbones):
- Linear / kernelised attention (Katharopoulos et al., 2020; Performer, Choromanski et al., 2021) — replace softmax with a feature map \(\phi\), yielding \(\text{Att} = \phi(Q)(\phi(K)^\top V)\) in \(O(N d^2)\).
- Gated Linear Attention — used in DiG (Zhu et al., 2025 — CVPR).
- Selective state-space (Mamba) layers — used in DiM, DiMSUM, U-Shape Mamba.
- Linear compressed attention for image-to-image, paired with softmax for image-text — used in EDiT / MM-EDiT (Suresh et al., 2025).
- Hybrid blocks with a fixed mix of linear and softmax attention layers — the dominant 2026 LLM trend, increasingly appearing in diffusion.
Token Merging, Pruning, and Sparse Attention¤
Rather than redesign the attention mechanism, several methods reduce the token count:
- ToMeSD (Bolya & Hoffman, 2023 — CVPRW) — merge similar tokens before each attention block in Stable Diffusion. Up to 60 % token reduction → 2× speedup, 5.6× memory reduction, no retraining, minimal quality loss.
- ToMA — Token Merge with Attention (2025) — adapts token merging specifically for diffusion transformers.
- DiffSparse (ICLR 2026) — differentiable layer-wise token sparsity with a learnable router; further accelerates DiTs without retraining.
- DyDiT / DyDiT++ (Zhao et al., 2024–25) — adaptive computation along both timestep and spatial dimensions; 51 % FLOPs reduction on DiT-XL.
Implementation: FlashAttention and Friends¤
Most production diffusion stacks compile attention through one of three backends, all of which are drop-in numerically-exact replacements for naive PyTorch / JAX softmax-attention:
- FlashAttention (Dao et al., 2022 — NeurIPS; FlashAttention-2, Dao 2023; FlashAttention-3, Shah et al., 2024) — IO-aware tiled attention that never materialises the \(N \times N\) score matrix in HBM. FA-2 hits ~70 % of an A100's theoretical FLOPS; FA-3 is tuned for Hopper's asynchronous Tensor Cores and TMA. Most modern DiTs would not be trainable at scale without it.
- xFormers memory-efficient attention (Meta) — earlier and still widely used in Stable Diffusion deployments.
- PyTorch SDPA / FlexAttention — vendor-blessed kernels with automatic dispatch to FA-3 / Triton on supported hardware.
Feature and KV Caching at Inference¤
Diffusion is iterative, so consecutive denoising steps usually have highly correlated attention activations. Modern fast-inference stacks cache them:
- DeepCache (Ma et al., 2024 — CVPR) — caches U-Net deep-feature activations across steps; ~2× speedup.
- Learning-to-Cache (L2C) (Ma et al., 2024 — NeurIPS) — learns which DiT layers are safe to cache between steps.
- TeaCache — feature caching baked into Lumina-Image 2.0, FLUX, and several open video stacks for ~2× inference speedup.
- ToCa, FORA, Δ-Cache — newer 2024–2025 methods that cache token-level attention output rather than whole layers.
These caching techniques compose with Flash Attention and with distillation, so a fully-optimised 2026 inference stack typically combines: distillation (1–4 NFEs) + flash-attention kernels + KV / feature caching + (optional) token merging + (optional) quantisation. The aggregate is the ~50–500× inference-time speedup over a 2022 raw-DDPM baseline that makes near-real-time T2I production viable today.
Choosing an Attention Pattern: a 2026 Cheat-Sheet¤
| Setting | Recommended attention pattern |
|---|---|
| Small-to-mid latent DiT (≤ 64²) | Full self-attention with FlashAttention-3, RoPE positions. |
| Large-resolution pixel-space (1024²+) | Neighborhood Attention (HDiT) or windowed attention. |
| Large-resolution latent (≥ 128²), efficiency-critical | Linear / SSM attention (DiG, DiM) or token-merging on top of full attention. |
| Multimodal text-to-image | Joint attention (MMDiT) on concatenated text+image tokens. |
| Video diffusion, short clips | Factorised space-time attention. |
| Video diffusion, long / high-res | Full 3-D attention with patchification + sparse / neighbourhood patterns. |
| Variable aspect ratios | 2-D RoPE + masked MHSA (FiT / FiTv2). |
| Production T2I inference | FA-3 + L2C / TeaCache + (optional) ToMeSD on top of a distilled student. |
The unifying picture: softmax self-attention plus RoPE positions has become the default inner loop of the modern diffusion backbone, while every other axis (windowing, factorisation, linearisation, token merging, caching) is a tool for managing its quadratic cost.
Model Parameterization Choices¤
The same denoising network can be trained to predict different targets derived from \(x_t\). Given the standard VP-SDE forward \(x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon\) (write \(a_t = \sqrt{\bar\alpha_t}\), \(b_t = \sqrt{1-\bar\alpha_t}\), so \(a_t^2 + b_t^2 = 1\)), the common prediction targets are:
- ε — the additive Gaussian noise (DDPM)
- x₀ — the clean image (data prediction)
- v — Salimans & Ho's "velocity" (a rotation of (ε, x₀))
- score — \(\nabla_{x_t} \log q(x_t)\) (score matching)
- F — the EDM "preconditioned" target (Karras et al. 2022)
- flow-matching velocity — \(u_t(x) = \mathbb{E}[x_1 - x_0 \mid x_t]\) (a different "velocity" from Salimans & Ho's v)
Crucially, given \(x_t\) and the schedule \((a_t, b_t)\) all of these are deterministic affine functions of each other — a network that learns one of them implicitly defines the others. So why do we ever care which target we regress against? Because the loss landscape, gradient scale, numerical conditioning, and effective per-timestep loss weighting all depend on the choice.
The Unifying Picture: A Linear Algebra of Targets¤
Pick any two of \((x_t, x_0, \epsilon)\) and the third is determined:
The score (in the VP-SDE) is just a rescaled negative noise:
A diagonal sanity check on \(v_t\):
| Target | Limit at high noise (\(a_t \to 0\), \(b_t \to 1\)) | Limit at low noise (\(a_t \to 1\), \(b_t \to 0\)) |
|---|---|---|
| \(\epsilon\) | \(\epsilon\) — unit Gaussian, well-scaled | small — signal-starved |
| \(x_0\) | \(x_0\) — data-scale (large), signal-starved | \(x_0\) — data-scale, well-scaled |
| \(v_t\) | \(\epsilon\) (since \(b_t \to 1\)) | \(-x_0\) (since \(a_t \to 1\)) |
| \(\nabla \log q\) | \(-\epsilon\) — finite | \(-\epsilon / 0\) — diverges |
The practical asymmetry is here: ε-prediction is well-scaled at high noise but provides almost no learning signal at low noise (because \(\epsilon \approx (x_t - x_0)/0\) there); \(x_0\)-prediction is well-scaled at low noise but supervisory at high noise becomes "guess the data from pure noise". \(v\)-prediction smoothly interpolates between the two as \(t\) varies, and the EDM \(F\)-target normalises the network output to unit variance at every \(t\).
graph TD
subgraph "Same network, six targets"
N["ε_θ(x_t, t)<br/>noise pred"] -.->|"affine"| X["x̂_0(x_t, t)<br/>data pred"]
X -.->|"affine"| V["v_θ(x_t, t)<br/>velocity"]
V -.->|"affine"| S["s_θ(x_t, t)<br/>score"]
S -.->|"preconditioning"| F["F_θ(x_t, σ)<br/>EDM target"]
F -.->|"flow-matching"| U["u_θ(x_t, t)<br/>FM velocity"]
end
style N fill:#e1f5ff
style X fill:#fff3e0
style V fill:#f3e5f5
style S fill:#e8f5e9
style F fill:#fff9c4
style U fill:#ffebee
ε-Prediction (Noise Prediction) — DDPM Default¤
Network predicts \(\epsilon_\theta(x_t, t) \approx \epsilon\):
- Pros: Numerically stable; the regression target is a unit-variance Gaussian across all timesteps, which conditions the optimisation well at moderate-to-high noise. Default for DDPM, Stable Diffusion 1.x, classifier guidance, and most open implementations.
- Cons: Vanishingly small training signal as \(b_t \to 0\) (the target becomes essentially random rounding noise). When converted back to \(x_0\) via \(\hat x_0 = (x_t - b_t\,\hat\epsilon)/a_t\), even a tiny error in \(\hat\epsilon\) at low noise blows up by \(1/a_t\) → blurry samples in the last denoising steps unless variance is learned (Nichol & Dhariwal, 2021).
x₀-Prediction (Data Prediction)¤
Network predicts \(\hat x_\theta(x_t, t) \approx x_0\):
- Pros: Cleanest target at low noise; the prediction is in the same space as the data, so dynamic-thresholding tricks (Saharia et al., 2022 — Imagen) and clipping to \([-1,1]\) are natural. The basis of consistency models, score-distillation sampling, and most distillation methods.
- Cons: At high noise the target is almost independent of \(x_t\), so the loss saturates and the network learns the data prior rather than how to denoise. Also produces out-of-range predictions before clipping.
v-Prediction (Velocity)¤
Salimans & Ho (2022) note that ε-pred and \(x_0\)-pred fail at opposite SNR extremes, and define the velocity target
So \(v_t\) is the tangent vector along the noise/data unit-circle. As \(t \to 0\), \(v_t \to -x_0\); as \(t \to T\), \(v_t \to \epsilon\). The L2 loss \(\|v - v_\theta\|^2\) therefore stays well-conditioned at every timestep, and gradient SNR varies much less across \(t\) than for ε or \(x_0\) alone.
- Pros: Best-in-class numerical stability, especially under cosine schedules where the SNR range is wide and asymmetric. Particularly important for progressive distillation (where teacher and student see very different noise levels) and video (where temporal extents push the schedule toward higher SNRs).
- Cons: Conceptually less transparent than ε / \(x_0\); requires care converting back for sampling.
- Production users: Imagen, Stable Diffusion 2.x at 768², SD3's rectified-flow target reduces to v-pred in a special case, progressive-distillation papers, and most modern video models (Sora-class).
The conversion back to ε / \(x_0\) from a v-prediction is just inverting the rotation:
Score Prediction (Score Matching)¤
The score is \(-\epsilon/b_t\) in the VP-SDE, so score-prediction is equivalent to ε-prediction up to a known scaling. Score-based models (Song & Ermon, 2019; Song et al., 2021) usually train directly in the score parameterisation because it ports cleanly to continuous-time SDE / probability-flow-ODE samplers. The price paid: the target diverges as \(b_t \to 0\), which forces noise-conditioning on the network and a careful sampler near \(t = 0\).
EDM "F-Prediction" with Preconditioning¤
Karras et al., 2022 — EDM reframe parameterisation choice as a preconditioning problem. In the variance-exploding (VE) parameterisation \(x_t = x_0 + \sigma \epsilon\), they wrap the network \(F_\theta\) with input/output rescalings:
with closed-form \(c_{\mathrm{skip}}, c_{\mathrm{out}}, c_{\mathrm{in}}, c_{\mathrm{noise}}\) chosen so that \(F_\theta\)'s input, target, and effective loss all have unit variance for every \(\sigma\). This is the most principled answer to "how should I parameterise?": ε, \(x_0\), and \(v\) are all special cases of EDM preconditioning at different \(\sigma\) regimes, and EDM smoothly interpolates them. Adopted (in spirit) by Stable Cascade, large-scale Karras-line models, and the EDM2 / ImageNet-512 SOTA pipelines.
Flow-Matching Velocity (a different "v")¤
To avoid notation collision: flow-matching velocity (Lipman et al., 2023) is a different object from Salimans & Ho's v-prediction. Flow matching defines a probability path \(p_t\) between data (\(p_0\)) and noise (\(p_1\)) via a conditional trajectory \(\psi_t(x_0, x_1)\), and trains a vector field
For the linear interpolant \(\psi_t = (1-t)x_0 + t\, x_1\) (rectified flow, Liu et al., 2023), the target is simply \(u_t = x_1 - x_0\). For the VP-Gaussian interpolant, the flow-matching velocity coincides with Salimans & Ho's v up to schedule-dependent rescaling. MMDiT (SD3) and FLUX.1 train the flow-matching velocity directly with a Logit-Normal timestep prior (Esser et al., 2024), which empirically beats v-prediction with a uniform prior on identical architectures.
How the Choice Interacts with Loss Weighting¤
The three classical losses are not the same loss with a relabelled target: they correspond to identical math only after the right per-timestep weighting. Hang et al. (2023) — Min-SNR make this explicit: for a unified Min-SNR-γ schedule, the loss weight depends on the parameterisation:
where \(\mathrm{SNR}(t) = a_t^2 / b_t^2\). With these weights the three are equivalent in expectation; with the naive unweighted MSE used in the simplified DDPM objective, ε-prediction implicitly upweights low-noise steps and \(x_0\)-prediction implicitly upweights high-noise steps.
Loss Re-weighting in 2024–2026: Logit-Normal, EDM2, and P2-Weighting¤
The implicit per-step weighting introduced by the parameterisation choice is now actively designed rather than accidental:
- EDM (Karras et al., 2022) samples \(\sigma\) from a log-normal distribution at training time, focusing capacity on the perceptually-relevant SNR band.
- EDM2 (Karras et al., 2024) makes this prior learnable and decouples loss weighting from the noise-sampling distribution.
- Logit-Normal sampling (Esser et al., 2024 — SD3) is the rectified-flow analogue: \(t \sim \mathrm{LogitNormal}(\mu, s^2)\) keeps mid-range timesteps oversampled.
- P2 weighting (Choi et al., 2022) prioritises the perceptually-rich phase of training and is widely used in 256² class-conditional ImageNet pipelines.
Practical Guidance: a 2026 Cheat-Sheet¤
| Setting | Recommended target | Why |
|---|---|---|
| Vanilla pixel-space DDPM, classical schedules | ε-prediction | Stable, simple; matches the Ho 2020 recipe and most pretrained checkpoints. |
| Latent diffusion at 512² (SD 1.x style) | ε-prediction + learned variance | The Nichol-Dhariwal hybrid objective; what SD-1.x ships. |
| Cosine / wide-SNR schedules, high-resolution, video | v-prediction | Numerically stable across the full SNR range; what SD 2.x@768, Imagen and most video models use. |
| Distillation (any direction → 1–4 steps) | v-prediction or x₀-prediction | v for progressive distillation; \(x_0\) for SDS / consistency / DMD-style methods. |
| Continuous-time SDE / score-based sampler | score | Cleanest plug-in to probability-flow ODEs and predictor-corrector samplers. |
| Karras-line ImageNet / FFHQ SOTA | EDM F-prediction with preconditioning | Unit-variance training target across all \(\sigma\); SOTA on every Karras paper since 2022. |
| Modern T2I / text-to-video at scale | flow-matching / rectified-flow velocity + Logit-Normal sampling | What SD3, FLUX.1, HunyuanVideo, Wan 2.x all use; superior prompt-following and text rendering empirically. |
Equivalence vs. preference
All six targets are mathematically equivalent through linear changes of variable. The empirical preference between them comes down to (a) which target is well-scaled in your training distribution's busiest SNR band, (b) how robust the sampler is to small errors in the chosen target near the SNR endpoints, and © which one composes most cleanly with your downstream task (distillation, guidance, ODE solving). Most production 2024–2026 models have moved to flow-matching velocity with Logit-Normal sampling; v-prediction remains the right default for U-Net-style models trained with cosine schedules.
Training Process¤
The Simplified Training Objective¤
The simplified loss ignores theoretical weightings from ELBO:
Training Algorithm:
- Sample training image \(x_0 \sim q(x_0)\)
- Sample timestep \(t \sim \text{Uniform}(1, T)\)
- Sample noise \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\)
- Compute noisy image \(x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon\)
- Predict noise \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\)
- Compute loss \(L = \|\epsilon - \hat{\epsilon}\|^2\)
- Update \(\theta\) via gradient descent
Remarkably simple: Just MSE between predicted and actual noise!
Loss Function Variants¤
Variational Lower Bound (VLB):
The full ELBO includes weighted terms for each timestep. While theoretically principled, optimizing full VLB is harder in practice.
Hybrid Objective (Nichol & Dhariwal, 2021):
Combines \(L_{\text{simple}}\) for mean prediction with VLB terms for variance learning:
Min-SNR-γ Weighting (Hang et al., 2023):
Clips weights at \(w_t = \min(\text{SNR}(t), \gamma)\) where \(\text{SNR}(t) = \bar{\alpha}_t / (1-\bar{\alpha}_t)\):
Typical \(\gamma = 5\). Achieves 3.4× faster convergence by preventing over-weighting easy steps.
Training Stability and Best Practices¤
Essential Training Practices
Exponential Moving Average (EMA): Critical for quality. Maintain running average:
with \(\beta = 0.9999\). Always use EMA weights for inference, not raw training weights.
Gradient Clipping: Prevents exploding gradients. Clip gradient norms to 1.0.
Mixed Precision Training: FP16/BF16 provides 2-3× speedup, 40-50% memory reduction.
Normalization:
- Group Normalization: Divide channels into groups (~32) for stability
- Layer Normalization: Alternative for transformer-based models
- No Batch Normalization: Batch statistics interfere with noise conditioning
Regularization:
- Weight Decay: \(10^{-4}\) to \(10^{-6}\) with AdamW optimizer
- Dropout: Sometimes used (rate 0.1-0.3) but less common than in other architectures
Hyperparameter Selection¤
Timesteps: \(T = 1000\) is standard for training. More steps provide finer granularity but slower sampling.
Noise Schedules:
- Cosine schedule outperforms linear empirically
- Critical: Ensure \(\bar{\alpha}_T \approx 0\) for pure noise at final step
Learning Rates:
- Standard: \(1 \times 10^{-4}\) to \(2 \times 10^{-4}\) with AdamW
- Sensitive domains (faces): \(1 \times 10^{-6}\) to \(2 \times 10^{-6}\)
- Use linear warmup over 500-1000 steps
Batch Sizes:
- Small images (32×32): 128-512
- Medium (256×256): 32-128
- Large (512×512): 8-32
- Use gradient accumulation to simulate larger batches
Optimizer Configuration:
Training Dynamics and Monitoring¤
Common Training Issues
Loss Plateaus: Normal behavior—loss doesn't directly correlate with quality. Monitor visual samples!
NaN Losses: Usually from exploding gradients. Enable gradient clipping and mixed precision loss scaling.
Poor Sample Quality: Check EMA is enabled, noise schedule is correct, sufficient training steps completed.
What to Monitor:
- Training Loss: Should decrease initially, then plateau
- Visual Samples: Generate every 5k-10k steps at fixed noise seeds
- FID Score: Compute on validation set every 25k-50k steps
- Gradient Norms: Should be stable, not exploding
- Learning Rate: Track warmup and decay schedules
Checkpoint Management:
- Save both regular and EMA weights
- Keep checkpoints every 50k-100k steps
- Save best checkpoint based on FID score
- Include optimizer state for resuming training
Computational Requirements¤
GPU Requirements (rough guidance for training a 256² latent-diffusion model from scratch):
- Minimum (small models, gradient accumulation, mixed precision): 16 GB VRAM (RTX 4060 Ti 16 GB / RTX 3090 / A4000).
- Recommended: 24 GB VRAM (RTX 3090, 4090, 5090, A5000).
- Large-scale: 40–80 GB per GPU (A100, H100, H200) across multi-node clusters.
- Inference-only of a published SD 1.5 / SDXL checkpoint runs comfortably on 8–12 GB consumer GPUs.
Training Times:
- Small datasets (10k images): Days on single GPU
- Medium (100k images): Weeks on multiple GPUs
- Large-scale (millions): Months on hundreds of GPUs
- ImageNet 256×256 on 8× A100: 7-14 days
Memory Optimizations:
- Gradient Checkpointing: 30-50% memory reduction, 20% slowdown
- Mixed Precision: 40-50% memory reduction, 2-3× speedup
- Smaller Batch Sizes: Use gradient accumulation to maintain effective batch size
Sampling Methods¤
DDPM Sampling: The Iterative Reverse Process¤
The foundational DDPM sampling starts from pure noise \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and iteratively denoises:
where \(z \sim \mathcal{N}(0, \mathbf{I})\) and \(\sigma_t\) controls stochasticity.
Algorithm:
- Sample \(x_T \sim \mathcal{N}(0, \mathbf{I})\)
- For \(t = T, T-1, \ldots, 1\):
- Predict noise: \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\)
- Compute mean: \(\mu_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon} \right)\)
- Sample: \(x_{t-1} = \mu_t + \sigma_t z\)
- Return \(x_0\)
Characteristics:
- Stochastic: Introduces randomness at each step.
- Slow: Requires \(T=1000\) neural network evaluations (seconds to minutes for a 256² image on a single GPU).
- High Quality: Excellent sample quality with sufficient steps.
- ~1000× slower than single-pass generators like GANs before applying any acceleration. Modern distilled samplers (DDIM-50 → 50× faster, DPM-Solver → 50–100×, LCM/SiD/DMD2 → 250–1000×) close most of this gap; see Recent Advances.
graph LR
A["x_T<br/>Pure Noise"] -->|"Denoise Step T"| B["x_{T-1}"]
B -->|"Denoise Step T-1"| C["x_{T-2}"]
C -->|"..."| D["x_t"]
D -->|"..."| E["x_1"]
E -->|"Final Denoise"| F["x_0<br/>Generated Image"]
style A fill:#ffccc9
style F fill:#c8e6c9
style D fill:#fff3e0
DDIM: Fast Deterministic Sampling¤
DDIM (Song, Meng & Ermon, 2021 — ICLR) constructs non-Markovian forward processes sharing DDPM's marginals but enabling much larger reverse steps:
When \(\sigma_t = 0\), sampling becomes fully deterministic.
Key Advantages:
- 10-50× speedup: Reduces from 1000 steps to 50-100 steps
- No retraining: Works with any pre-trained DDPM checkpoint
- Deterministic: When \(\eta = 0\), enables consistent reconstructions
- Interpolation: Meaningful latent space interpolation
Algorithm (Deterministic \(\sigma_t = 0\)):
- Sample \(x_T \sim \mathcal{N}(0, \mathbf{I})\)
- Choose subset of timesteps \(\{\tau_1, \tau_2, \ldots, \tau_S\}\) where \(S \ll T\)
- For \(i = S, S-1, \ldots, 1\):
- Predict \(x_0\): \(\hat{x}_0 = \frac{x_{\tau_i} - \sqrt{1-\bar{\alpha}_{\tau_i}} \epsilon_\theta(x_{\tau_i}, \tau_i)}{\sqrt{\bar{\alpha}_{\tau_i}}}\)
- Compute \(x_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} \hat{x}_0 + \sqrt{1-\bar{\alpha}_{\tau_{i-1}}} \epsilon_\theta(x_{\tau_i}, \tau_i)\)
- Return \(x_0\)
DDIM in Practice
DDIM became the standard inference method for production systems. Stable Diffusion defaults to 50 DDIM steps for a good quality/speed trade-off. Fewer steps (20-25) work for quick previews.
Advanced ODE Solvers¤
DPM-Solver (Lu et al., 2022 — NeurIPS):
Treats diffusion sampling as solving ODEs with specialised numerical methods:
- Higher-order solver (order 2–3) with convergence guarantees.
- FID 4.70 in 10 steps, 2.87 in 20 steps on CIFAR-10.
- 4–16× speedup over previous samplers.
DPM-Solver++ (Lu et al., 2023):
Addresses instability with large classifier-free-guidance scales:
- Uses data prediction with dynamic thresholding.
- Performs well with 15–20 steps for guided sampling.
- Better numerical stability than DPM-Solver under large CFG scales.
PNDM — Pseudo Numerical Methods (Liu et al., 2022 — ICLR):
Treats DDPMs as solving differential equations on manifolds:
- Higher quality at 50 steps than 1000-step DDIM.
- 20× speedup with quality improvement.
UniPC (Zhao et al., 2023) and DEIS (Zhang & Chen, 2022) are common modern alternatives in production samplers.
Consistency Models: One-Step Generation¤
Consistency Models (Song et al., 2023 — ICML) and the follow-up Improved Techniques for Training Consistency Models (ICLR'24 Oral) propose a paradigm shift: learn a consistency function \(f\) that directly maps any point on a trajectory to its endpoint:
The self-consistency property:
Consistency Distillation:
Train by distilling from pre-trained diffusion model:
where \(\hat{x}_{t_n}^\phi\) is one step of ODE solver from \(x_{t_{n+1}}\) using teacher model.
Results:
- FID 3.55 on CIFAR-10 in one step
- Improved techniques: FID 2.51 in one step, 2.24 in two steps
- Consistency Trajectory Models (CTM): FID 1.73 in one step
- Zero-shot editing without task-specific training
Revolutionary Speed
Consistency models achieve 1000× speedup over DDPM while maintaining competitive quality. This makes diffusion viable for real-time applications.
Guidance Techniques¤
Classifier Guidance (Dhariwal & Nichol, 2021):
Modifies the score using gradients from a separately trained noise-aware classifier:
where \(w\) is guidance scale, \(c\) is class label.
Advantages: First method to push diffusion past GANs on ImageNet (FID 4.59 vs BigGAN-deep's 6.95). Disadvantages: Requires training noise-aware classifiers at all noise levels.
Classifier-Free Guidance (Ho & Salimans, 2022):
Eliminates the classifier by jointly training conditional and unconditional models:
During training, randomly drop condition \(c\) with probability ~10%.
Advantages:
- No auxiliary classifier needed
- Often better quality than classifier guidance
- Single guidance scale \(w\) controls trade-off
- Industry standard: Used by DALL·E 2, Stable Diffusion, Midjourney, Imagen
Common Guidance Scales:
- \(w = 1.0\): No guidance (unconditional)
- \(w = 3-5\): Moderate guidance, balanced quality/diversity
- \(w = 7-8\): Standard guidance (Stable Diffusion default: 7.5)
- \(w = 10-15\): Strong guidance, high fidelity but lower diversity
- \(w > 20\): Over-guided, saturated colors, artifacts
graph LR
A["Conditional<br/>ε(x_t, c)"] --> C["Guidance<br/>Interpolation"]
B["Unconditional<br/>ε(x_t, ∅)"] --> C
C --> D["Guided Noise<br/>Prediction"]
D --> E["Denoising<br/>Step"]
style A fill:#e1f5ff
style B fill:#ffccc9
style D fill:#c8e6c9
Diffusion Model Variants¤
Latent Diffusion Models (LDM / Stable Diffusion)¤
Latent Diffusion (Rombach et al., 2022 — CVPR) introduced the most important architectural pattern in modern diffusion: run diffusion in a VAE latent space, not in pixel space.
Two-Stage Approach:
- Train autoencoder compressing images 8× (512×512×3 → 64×64×4)
- Run diffusion in compressed latent space
Architecture:
graph TB
A["Input Image<br/>512×512×3"] --> B["VAE Encoder"]
B --> C["Latent Space<br/>64×64×4<br/>(8× compression)"]
C --> D["Diffusion U-Net<br/>(with cross-attention)"]
E["Text Prompt"] --> F["CLIP/T5<br/>Encoder"]
F --> D
D --> G["Denoised Latent<br/>64×64×4"]
G --> H["VAE Decoder"]
H --> I["Generated Image<br/>512×512×3"]
style C fill:#fff3e0
style D fill:#e1f5ff
style I fill:#c8e6c9
Key Benefits:
- 2.7× training/inference speedup
- 1.6× FID improvement
- Massively reduced memory: ~10GB VRAM for 512×512 generation
- Cross-attention conditioning: Text embeddings guide generation
Stable Diffusion Implementation:
- 860M-parameter U-Net in latent space
- Trained on LAION-5B (5 billion text-image pairs)
- Open-source release democratized text-to-image generation
- Versions: SD 1.4, 1.5, 2.0, 2.1 (U-Net latents), SDXL (2.6B U-Net), SD3 / SD3.5 (MMDiT, up to 8B), with SD-Turbo and FLUX.1 [schnell] adding 1–4-step distilled variants.
SDXL (Podell et al., 2023):
- 2.6B-parameter UNet (≈ 3× SD 1.5)
- Dual text encoders (OpenCLIP-G + CLIP-L) concatenated
- Native 1024×1024 training and generation
- Two-stage: base model + refiner
Stable Diffusion 3 / SD3.5 (Esser et al., 2024 — MMDiT):
- Rectified-flow Transformer (MMDiT) replacing U-Net
- Multimodal Diffusion Transformer with separate streams for text and image tokens
- Sizes: 800M, 2B, 8B parameters
- Substantially better text rendering and prompt following
FLUX.1 (Black Forest Labs, 2024):
- 12B-parameter rectified-flow transformer (the largest open-weights image model at release).
- Hybrid multimodal + parallel-DiT blocks with rotary positional embeddings.
- Three variants: FLUX.1 [pro] (closed, top quality), FLUX.1 [dev] (open, non-commercial), FLUX.1 [schnell] (Apache-2.0, distilled to 1–4 steps via latent adversarial diffusion distillation — see the Recent Advances section).
- Routinely ranks #1 / #2 on community prompt-following and visual-quality leaderboards in 2024–2025.
Impact
Latent diffusion made high-quality generation accessible. Stable Diffusion has 100M+ users and massive ecosystem of fine-tunes, LoRAs, and community tools.
Conditional Diffusion Models¤
Class-Conditional Generation:
Add class label \(y\) as conditioning:
Typically implemented via:
- Class embeddings concatenated with time embeddings
- Conditional batch normalization: Modulate batch norm with class info
- Cross-attention: Attend to class token
Text-to-Image Models:
DALL·E 2 (Ramesh et al., 2022 — unCLIP):
- Two-stage: CLIP prior diffusion + decoder.
- CLIP prior maps text embeddings to image embeddings.
- Decoder is a 64×64 diffusion model + two cascaded super-resolution diffusion models (→ 256² → 1024²).
- Successors: DALL·E 3 (Betker et al., 2023) — improves prompt following via better caption supervision.
Imagen (Saharia et al., 2022):
- Frozen T5-XXL (4.6B parameters) as text encoder.
- Cascaded diffusion: 64×64 → 256×256 → 1024×1024.
- Key finding: scaling the text encoder improves quality more than scaling the U-Net.
- FID 7.27 on COCO (state-of-the-art at time).
- Imagen 3 (Google, 2024) is the latest iteration, with improved photorealism and text-rendering.
Midjourney (2022–2025):
- Proprietary diffusion / DiT-style model.
- Versions V1 → V7 by mid-2025.
- ~20 M users; one of the largest revenue-generating generative-AI services.
Cascade Diffusion Models¤
Generate through multiple resolution stages, each a separate diffusion model:
- Base model: Generate at 64×64
- Super-resolution 1: Upscale to 256×256
- Super-resolution 2: Upscale to 1024×1024
Advantages:
- Each stage focuses on different detail scales
- More efficient than single high-resolution model
- Better quality through specialized models
Disadvantages:
- Complexity of training multiple models
- Error accumulation across stages
Used in DALL·E 2 and Imagen.
Video Diffusion Models¤
Video diffusion extends spatial generation to the temporal dimension. Two architectural lineages dominate:
- 3-D U-Net (Imagen Video, AnimateDiff, Stable Video Diffusion): inflate a 2-D U-Net's spatial convolutions to spatio-temporal, with factorised space-time attention — see Spatial, Temporal, and 3-D Attention.
- Spacetime DiT (Sora, HunyuanVideo, Wan, Mochi-1, CogVideoX, LTX-Video): patchify the video at \((t, h, w)\) resolution and run a transformer over the unified spatio-temporal token sequence.
The transformer-based stack has become the preferred direction for many new large-scale video models since Sora (2024), while U-Net-style systems remain important in earlier and lighter deployments.
Sora (Brooks et al., OpenAI, 2024):
- Diffusion Transformer on spacetime patches — videos compressed into a lower-dimensional latent, then patchified like a ViT.
- Up to 1 minute of 1080p video, variable aspect ratios.
- Qualitative evidence of physical consistency and object permanence at scale, alongside documented limitations.
Sora 2 (OpenAI, 2025):
- Synchronized audio generation.
- Improved physics simulation, multi-shot continuity.
- Instruction-following for long, complex scenes.
Veo 3 (Google DeepMind, 2025) and Movie Gen (Polyak et al., Meta, 2024) are the leading closed-weights competitors.
Open-source video models (2024–2025):
- HunyuanVideo (Tencent, 2024) — 13B-parameter open foundation video model; outperformed Luma 1.6 and several closed Chinese models on human-eval text alignment, motion quality, and visual quality.
- CogVideoX (Yang et al., 2024) — 2B and 5B parameter expert-MoE diffusion models with 3-D causal-conv VAE.
- Mochi-1 (Genmo, 2024) — 10B asymmetric DiT (AsymmDiT) with an 8×8 spatial / 6× temporal VAE; 480p, 5.4 s, Apache-2.0.
- Wan 2.1 / 2.2 (Alibaba, 2025) — 2.2 introduces a Mixture-of-Experts diffusion backbone that splits denoising across timesteps into specialised experts; current open-weights SOTA on motion quality.
- LTX-Video (Lightricks, 2025) — real-time video generation distilled to few steps.
- Stable Video Diffusion (Blattmann et al., 2023) — image-to-video extension of SD.
3D Diffusion Models¤
DreamFusion (Poole et al., 2022):
Uses 2D text-to-image models as priors for 3D generation via Score Distillation Sampling (SDS):
- Initialize random 3D NeRF
- Render 2D views from random camera angles
- Apply noise and use Imagen to denoise
- Backpropagate through rendering to update NeRF
- Repeat
Key Innovation: No 3D training data required—leverages 2D diffusion models.
Capabilities:
- Text-to-3D generation
- Viewable from any angle
- Relightable
- Exportable as meshes
Stable-DreamFusion:
- Uses Stable Diffusion instead of Imagen
- Open-source implementation
- Enables text-to-3D and image-to-3D
Applications: Game assets, VR/AR content, product design, 3D scene reconstruction
Comparison with Other Generative Models¤
Diffusion vs. GANs¤
| Aspect | Diffusion Models | GANs |
|---|---|---|
| Sample Quality | SOTA (FID 1.2–2.5 on standard benchmarks) | Competitive after R3GAN / GigaGAN; classical GANs trail by ~1–2 FID points |
| Training Stability | Very stable, MSE-style objectives | Notoriously unstable; needs spectral norm + R1 + careful tuning |
| Mode Coverage | Excellent, likelihood-based | Prone to mode collapse |
| Inference Speed | 20–1000 steps un-distilled; 1–4 steps distilled | 1 forward pass |
| Training Ease | Forgiving hyperparameters | Requires careful balancing |
| Controllability | Excellent (CFG, guidance, editing, ControlNet) | Strong via latent editing; weaker for complex prompts |
| Latent Space | Implicit in the noise trajectory; no encoder | Implicit; trained encoder optional (ALI, BiGAN) |
Speed Comparison (rough orders-of-magnitude on a single consumer GPU at 256–512² resolution):
- GANs: ~0.01–0.05 s per image (single pass).
- DDPM: 10–60 s per image (1000 steps).
- DDIM (50 steps): 1–5 s per image.
- DPM-Solver / UniPC (10–20 steps): 0.3–1 s per image.
- LCM / SDXL Turbo / DMD2 / FLUX.1 [schnell] (1–4 steps): 0.05–0.3 s per image — competitive with raw GAN inference in practice.
When to Use Each
Choose Diffusion: - Quality/diversity paramount - Training stability important - Denoising, super-resolution, inpainting tasks - Computational resources available
Choose GANs: - Real-time generation required - Single-pass critical (e.g., video style transfer) - Limited inference compute - Interactive applications
Diffusion vs. VAEs¤
| Aspect | Diffusion Models | VAEs |
|---|---|---|
| Sample Quality | Sharp, high-fidelity | Blurrier under L2 / Gaussian-decoder; sharp once paired with adversarial / VQ losses |
| Latent Space | No explicit low-dim latent | Explicit, interpretable, supports interpolation and arithmetic |
| Likelihood | Tractable via the probability-flow ODE | Tractable lower bound (ELBO) |
| Training | Stable | Very stable |
| Inference Speed | Multi-step (1–1000 NFEs) | Single forward pass |
| Role in modern stacks | The generator in latent-space pipelines | The codec feeding diffusion (SD-VAE, video VAEs) |
Hybrid Approach: Latent Diffusion¤
These two models are not competitors in 2026 — they are complementary stages. Stable Diffusion, SDXL, SD3, FLUX.1, and many leading open video models use an autoencoder / VAE-like codec to compress pixels before running diffusion or flow matching on the compressed latent. Sora's public technical report likewise describes a compressed latent representation decomposed into spacetime patches, but it does not expose enough implementation detail to call the codec a specific VAE variant. The full pattern is described in Latent Diffusion Models. The VAE gives a 2.7× speedup over pixel-space diffusion at matching quality and shrinks memory by an order of magnitude — see also the VAE explainer's tokenizer-VAE section.
Diffusion vs. Autoregressive Models¤
| Aspect | Diffusion | Autoregressive |
|---|---|---|
| Generation order | Many tokens in parallel per step; arbitrary positions revisited | Strictly left-to-right (or fixed scan order) |
| Per-image / per-sequence cost | \(K\) NFEs × full forward pass (parallel over tokens) | \(N\) tokens × full forward pass (sequential) |
| Quality (images / video / audio) | Often strongest for high-fidelity visual / audio synthesis | Strong (VAR, MAR); previously trailed diffusion |
| Quality (text) | Catching up in selected settings — LLaDA, Mercury, Dream, and ReFusion show competitive 8B-scale results | Still the dominant production choice |
| Controllability | Excellent — edit / inpaint at any position, classifier-free guidance | Limited — conditioning must precede generated tokens |
| Variable length | Fixed-length outputs (or padded) | Natural variable-length |
Complementary strengths. Diffusion remains the default for images, audio, and video; autoregression remains the default for text but diffusion language models (covered in Recent Advances → Diffusion Language Models) are emerging as a serious alternative — and the VAR / MAR family (Tian et al., 2024; Li et al., 2024) shows AR can be competitive for images too. The clean dichotomy of the early 2020s has dissolved in both directions.
Hybrid Models:
HART — Hybrid Autoregressive Transformer (Tang et al., 2024):
- Autoregressive for coarse structure
- Diffusion for fine details
- 9× faster than pure diffusion at matching quality
- 31% less training compute
Theoretical Frameworks¤
Beyond the basic DDPM / score-based / flow-matching formulations, two design-space frameworks have shaped modern training recipes. Flow matching and rectified flow are covered in Recent Advances → Flow Matching at Scale; EDM, the unifying parameterisation framework for continuous-time diffusion, is covered next.
EDM: Elucidating the Design Space¤
The EDM line of work — Karras et al., 2022 (NeurIPS) — Elucidating the Design Space and Karras et al., 2024 (CVPR Oral) — Analyzing and Improving Training Dynamics — provides a unified framework separating the otherwise-tangled design choices (preconditioning, schedule, sampler, EMA):
Optimal Preconditioning:
Normalizing inputs/outputs of network for better training:
Karras Noise Schedule:
Widely adopted in practice, superior to linear/cosine.
Results: FID 1.79 on CIFAR-10 (state-of-the-art)
EDM2 (2024):
- Analyzed training dynamics at scale
- Redesigned architecture with magnitude preservation
- Post-hoc EMA: Set parameters after training without retraining
- FID 1.81 on ImageNet-512 (previous record: 2.41)
For the broader 2024–2026 picture — distillation, diffusion language models, post-training, and open-source video — see the dedicated Recent Advances section that follows. For backbone architectures see Beyond U-Net above.
Recent Advances (2024–2026): Distillation, Flow Matching, and Diffusion Language Models¤
If 2020–2023 was about establishing diffusion as the dominant generative-model class, 2024–2026 is about closing the inference-cost gap, scaling DiTs, expanding into language and video, and post-training for human preferences. This section surveys the key threads, grounded in recent (last-six-months) surveys: Yang et al. — Diffusion Models: A Comprehensive Survey of Methods and Applications (continuously updated, ACM CSUR 2024), Efficient Diffusion Models (2024), Efficient Diffusion Models: A Survey (2025), and the Survey of Video Diffusion Models (April 2025).
Few-Step Distillation: From 50 Steps to 1¤
Producing a high-quality image with a stock diffusion model still required 20–50 NFEs in 2023. The 2023–2025 wave of distillation methods has cut this to 1–4 NFEs at near-teacher quality:
| Method | Year | Idea | Steps | Notable result |
|---|---|---|---|---|
| Progressive Distillation (Salimans & Ho, 2022) | 2022 | Iteratively halve the teacher's step count. | 4 | FID 3.0 on CIFAR-10. |
| Consistency Models (Song et al., 2023) | 2023 | Self-consistency map: every point on a trajectory decodes to the endpoint. | 1–2 | FID 3.55 → 2.51 → 1.73 (CTM, Kim et al., 2024). |
| Latent Consistency Models / LCM-LoRA (Luo et al., 2023; Luo et al., 2023) | 2023 | Distill any latent-diffusion model in ~32 A100-hours; LCM-LoRA is a plug-in LoRA accelerator that drops into any SD fine-tune. | 2–4 | 768² in 2–4 steps. |
| Adversarial Diffusion Distillation (ADD) / SDXL Turbo (Sauer et al., 2023) | 2023 | Add a GAN-style discriminator (frozen DINOv2 backbone) on top of score-distillation. | 1–4 | Single-step SDXL Turbo at SDXL-class quality. |
| Score Identity Distillation (SiD / SiDA) (Zhou et al., ICML 2024) | 2024 | Three score-related identities yield exponentially fast FID decay; SiDA adds adversarial real-data signal. | 1 | FID 1.499 (CIFAR-10), 1.110 (ImageNet 64²). |
| DMD / DMD2 (Yin et al., 2024 — NeurIPS Oral) | 2024 | Distribution-Matching Distillation: KL between student and teacher's noise-conditional distributions, plus a GAN loss against real data (DMD2). | 1 | FID 1.28 (ImageNet 64²), 8.35 (zero-shot COCO 2014); SOTA one-step. |
| Diffusion2GAN (Kang et al., 2024) | 2024 | Distil into a conditional GAN with E-LatentLPIPS. | 1 | Megapixel one-step. |
| APT — Adversarial Post-Training (Lin et al., 2025) | 2025 | Single-step distillation extended to video, with an approximated R1. | 1 | First Sora-class 1-step video. |
| Few-step SiD (Zhou et al., 2025) | 2025 | Multi-step generalisation of SiD. | 2–4 | Tighter FID at higher step count. |
The throughline: adversarial losses are now a major branch of diffusion distillation, blurring the GAN-vs-diffusion distinction while coexisting with non-adversarial consistency / LCM-style methods (see also GANs Explained — Adversarial Distillation).
Flow Matching and Rectified Flow at Scale¤
Continuous-time flow models have become common for new large-scale visual generators, especially when paired with transformer backbones.
Flow Matching (Lipman et al., 2023 — ICLR) trains a vector field \(v_\theta(x, t)\) to regress against a target conditional probability path (Gaussian, optimal-transport, etc.):
- Simulation-free training — no ODE solving during training.
- More stable dynamics than discrete-time DDPM.
- Compatible with arbitrary noise–data couplings, generalised by Stochastic Interpolants (Albergo & Vanden-Eijnden, 2023; Albergo et al., 2024) which bridge ODEs and SDEs in a single framework.
Rectified Flow (Liu, Gong & Liu, 2023 — ICLR) is a special case that learns ODEs following straight paths between noise and data:
where the conditional trajectories are straight lines \(\psi_t = (1-t)x_0 + t\, x_1\). The reflow operation iteratively straightens trajectories:
- Sample pairs \((x_0, x_1)\) from data and noise.
- Train \(v_\theta\) on the (curved) interpolant.
- Use \(v_\theta\) to generate new deterministic pairs.
- Retrain on these straightened pairs.
- Repeat.
After one or two reflow rounds, InstaFlow (Liu et al., 2023) achieves high quality in 1–2 steps (~0.12 s/image).
Production scale: rectified-flow + Logit-Normal sampling powers MMDiT in Stable Diffusion 3 / 3.5 (Esser et al., 2024), FLUX.1 (Black Forest Labs, 2024), and many modern open video DiTs (HunyuanVideo, Wan 2.2, Mochi-1). It is a common training objective for new large-scale image and video generators in 2025–2026.
Representation Alignment: Faster DiT Training (2024)¤
REPA (Yu et al., 2024 — ICLR'25 Oral) regularises a small number of intermediate DiT activations to match the features of a frozen self-supervised encoder (DINOv2). One simple feature-alignment loss → 17.5× faster SiT-XL training, and FID 1.42 with classifier-free guidance — the best-in-class for ImageNet 256² class-conditional. Follow-ups include U-REPA for U-Net architectures and the REPA-stop study (2025) showing the alignment is most useful early in training.
Diffusion Language Models (2024–2026)¤
The biggest new modality for diffusion is discrete text. Discrete diffusion (Austin et al., 2021 — D3PM) was a niche idea until 2024; in 2025 it became a serious competitor to autoregressive LLMs:
- LLaDA (Nie et al., 2025) — a masked-diffusion LLM trained from scratch at 8B parameters, competitive with LLaMA3-8B on in-context learning and instruction following after SFT.
- Mercury (Inception Labs, 2025) — a commercial-scale diffusion LLM with multi-token parallel decoding, reporting substantially lower latency than autoregressive baselines in selected settings.
- ReFusion (2025) — interleaves diffusion-based selection with autoregressive infilling, enabling KV-cache reuse; 18× speedup on average.
- Dream / SEDD and the broader masked-diffusion-LM family report competitive scaling on reasoning and language benchmarks at similar parameter counts, but the broad production comparison with AR LLMs is still early.
The structural advantage: diffusion LLMs predict many tokens in parallel, so they decouple sequence length from latency — the dominant 2024–2026 motivation for moving past autoregression. See VILA-Lab's Awesome-DLMs for a continuously-updated survey.
Open-Source Video Diffusion (2024–2025)¤
The video-generation landscape went from "Sora is unique" to a thriving open-source ecosystem in 18 months. Leading open models such as HunyuanVideo, Wan 2.2, Mochi-1, CogVideoX, and LTX-Video follow a related recipe: a 3-D video autoencoder / VAE-like codec compresses the video, then a rectified-flow / flow-matching DiT (or MMDiT) is trained on the latents with cross-attention or joint attention to a text encoder.
graph TD
A["Pixel video<br/>T×H×W×3"] --> B["3-D causal<br/>VAE encoder"]
B --> C["Spatio-temporal<br/>latent z"]
C --> D["DiT / MMDiT<br/>(text + frame patches)"]
D --> E["Denoised z'"]
E --> F["VAE decoder"]
F --> G["Generated video"]
style B fill:#f3e5f5
style D fill:#e1f5ff
style F fill:#fff3e0
The full per-model list with citations and benchmark numbers is in Diffusion Model Variants → Video Diffusion Models. The VAE side of this stack (CV-VAE, WF-VAE, IV-VAE, OD-VAE) is covered in the VAE explainer's tokenizer section.
Post-Training: RLHF, DPO, Reward Models for Diffusion¤
As with LLMs, the 2024–2026 frontier is post-training, not pre-training:
- DPO for diffusion (Wallace et al., 2024 — Diffusion-DPO) — direct preference optimisation in \(x_0\)-space, no separate reward model needed.
- DDPO (Black et al., 2024) — REINFORCE-style RL with reward models (CLIP, aesthetic, OCR).
- Reward-guided distillation (Kim et al., 2024) — fine-tune the distilled student against a reward model.
- RAFT, Aligning text-to-image, Reward-Imagenet — a thread of large-scale reward-model-based fine-tunes that drove most of the 2024 SD/SDXL/Flux quality jump.
A Brief Note on Anti-Diffusion: 2026 Critiques¤
A small but growing 2025–2026 thread questions the centrality of diffusion:
- MAR (Li et al., 2024) shows masked-AR generation can match diffusion quality.
- VAR (Tian et al., 2024) — visual autoregression with next-scale prediction; first AR model to beat DiT on ImageNet.
- LDM-without-VAE (2025–2026) — argues the VAE bottleneck hurts diffusion training; explores VAE-free LDMs.
These are not replacements for diffusion in the broad sense, but they suggest the algorithmic frontier is unifying (DiT, masked AR, flow matching, consistency, adversarial) rather than picking a single winner.
Production Considerations¤
Deployment Challenges¤
Model Size (full-precision FP32; FP16/BF16 halves these numbers):
- DDPM (256×256): ~200–500 MB.
- Stable Diffusion 1.5: 860 M-param U-Net + ~85 M-param VAE ≈ 4 GB total.
- SDXL: 2.6 B-param U-Net (≈ 10 GB).
- SD3 / SD3.5: 800 M to 8 B parameters (MMDiT).
- FLUX.1: 12 B parameters (~24 GB), distilled [schnell] variant runs much lighter at inference.
Optimization Strategies:
- Model Pruning: Remove 30-50% weights with <5% quality loss
- Quantization: INT8/FP16 reduces size 2-4× with minimal quality loss
- Knowledge Distillation: Train smaller student model
Inference Optimization:
- ONNX Runtime: 10-30% speedup
- TensorRT: 2-5× speedup on NVIDIA GPUs
- JAX compilation: Use
jax.jitandjax.lax.scanto compile repeated denoising loops - Flash Attention: 2-3× speedup for attention layers
Hardware Requirements:
- Inference: Minimum RTX 3060 (12GB) for 512×512
- Recommended: RTX 4090 (24GB) or professional GPUs
- Edge Deployment: Optimized models run on mobile (SD Turbo, LCM)
Monitoring and Quality Control¤
Quality Drift:
Monitor generated samples over time for:
- Artifacts or distortions
- Color shifts
- Mode collapse
- Prompt adherence degradation
Metrics:
- FID: Track on validation set every N samples
- CLIP Score: For text-to-image, measure alignment
- Human Evaluation: A/B tests for subjective quality
- Diversity Metrics: Ensure mode coverage
A/B Testing:
Compare model versions using:
- FID/IS on held-out data
- Human preference studies (typically 1000+ comparisons)
- Production metrics (engagement, retention, quality reports)
Ethical Considerations¤
Responsible Deployment
Deepfakes and Misinformation:
Diffusion models enable photorealistic fake images/videos. Mitigation strategies:
- Watermarking generated content
- Provenance tracking (C2PA metadata)
- Detection models for synthetic content
- Usage policies and terms of service
Bias and Fairness:
Models inherit biases from training data (LAION-5B, etc.):
- Underrepresentation of minorities
- Stereotypical associations
- Geographic/cultural biases
Mitigation:
- Balanced training data curation
- Bias evaluation across demographics
- Red-teaming for harmful generations
Copyright and Attribution:
Training on copyrighted images raises questions:
- Fair use vs. infringement debates ongoing
- Artist consent and compensation
- Attribution for training data
Best practices:
- Respect opt-out requests (Have I Been Trained)
- Consider ethical training data sources
- Transparent documentation of training data
Environmental Impact:
Large-scale training requires massive compute:
- ImageNet training: 1000s of GPU-hours
- Stable Diffusion: ~150,000 A100-hours
- SDXL: ~500,000 A100-hours
Mitigation:
- Efficient architectures (Latent Diffusion)
- Distillation for deployment
- Carbon-aware training scheduling
- Renewable energy for data centers
Safety Filters and Content Moderation¤
Safety Classifiers:
Pre-deployment filters to prevent harmful content:
- NSFW Detection: Classify unsafe content
- Violence Detection: Flag graphic violence
- Hate Symbol Detection: Block extremist imagery
Prompt Filtering:
- Block harmful prompt patterns
- Detect adversarial prompts
- Rate limiting for abuse prevention
Post-Generation Filtering:
- Run safety classifier on outputs
- Block unsafe images before showing user
- Log violations for monitoring
Summary and Key Takeaways¤
Diffusion models have revolutionized generative AI through an elegant approach: learning to reverse a gradual noising process. By systematically destroying data structure through fixed forward diffusion and learning the reverse process through neural networks, these models achieve state-of-the-art quality with remarkable training stability.
Core Principles:
- Forward diffusion gradually corrupts data into pure noise over \(T\) timesteps
- Reverse diffusion learns to progressively denoise, reconstructing data from noise
- Training objective reduces to simple MSE between predicted and actual noise
- Sampling iteratively applies learned denoising, refining noise into data
Key Variants:
- DDPM: Foundational stochastic sampling with 1000 steps.
- DDIM: Deterministic fast sampling reducing to 50–100 steps.
- Latent Diffusion (Stable Diffusion family): operates in a VAE latent space — the dominant production recipe.
- Consistency Models / LCM / SiD / DMD2: one-step or 1–4-step generation achieving 100–1000× speedups via distillation.
- Rectified Flow / Flow-Matching DiTs: straight-path ODEs powering SD3, FLUX.1, HunyuanVideo, Wan 2.2.
- Diffusion Transformers (DiT, U-ViT, MMDiT, SiT): transformer backbones that scale better than U-Nets.
- Autoregressive visual challengers (MAR, VAR): transformer image generators that borrow GPT-style scaling ideas rather than diffusion denoising.
- Diffusion Language Models (LLaDA, Mercury, Dream, ReFusion): masked discrete diffusion competitive with autoregressive baselines in selected 8B-scale settings.
Architecture Innovations:
- U-Net with skip connections — the classical backbone (DDPM, SD 1.x / 2.x, Imagen, ControlNet).
- Diffusion Transformers (DiT, U-ViT, MMDiT, PixArt, HDiT, SiT, FiT) — the dominant 2024–2026 backbone for new large-scale models; covered in Beyond U-Net.
- adaLN / adaLN-Zero / adaLN-single conditioning — the standard timestep-injection mechanism for transformer backbones.
- Joint attention (MMDiT) — concatenated text+image streams; the key innovation behind SD3 and FLUX.1's text rendering.
- Sparse / linear-attention backbones — DiT-MoE (16B sparse), DiM, DiG, EDiT, HDiT-style neighbourhood attention for high-resolution efficiency.
- 2-D RoPE positions — variable-resolution / variable-aspect-ratio support (FiT, FLUX.1).
- FlashAttention-⅔ + feature caching (DeepCache, L2C, TeaCache) — the implementation foundation that makes everything above trainable and deployable.
Training Best Practices:
- Use cosine noise schedule for better dynamics
- Apply EMA with decay 0.9999—critical for quality
- Gradient clipping and mixed precision for stability
- Monitor visual samples not just loss curves
- Min-SNR weighting for 3.4× faster convergence
Sampling Methods:
- DDPM: 1000 steps, highest quality at the slowest speed.
- DDIM: 50–100 steps, deterministic, 10–20× faster.
- DPM-Solver / DPM-Solver++ / UniPC / DEIS: 10–20 steps with high-order ODE solvers.
- Consistency Models / LCM / SiD / DMD2 / ADD: 1–4 distilled steps, near real-time at near-teacher quality.
Guidance Techniques:
- Classifier-free guidance as industry standard
- Guidance scale \(w=7-8\) balances quality and diversity
- Higher \(w\) increases fidelity, reduces diversity
When to Use Diffusion Models:
- Quality and diversity are paramount
- Denoising, super-resolution, inpainting, editing tasks
- Text-to-image, text-to-video generation
- Scientific applications (protein design, drug discovery)
- Training stability more important than inference speed
Current Landscape (2025–2026):
- Diffusion dominates high-quality image, video, and 3D generation.
- Distillation has closed the speed gap — 1–4-step samplers (LCM, SDXL Turbo, DMD2, FLUX.1 [schnell]) approach teacher quality at near-GAN inference cost.
- Rectified-flow DiTs (SD3, FLUX.1, HunyuanVideo, Wan 2.2) are increasingly preferred over U-Net/DDPM recipes for new large-scale visual models.
- Diffusion language models (LLaDA, Mercury) are emerging as a serious AR-LLM alternative, with substantially lower decoding latency reported in selected settings.
- Open-source video (HunyuanVideo, Wan, Mochi-1, CogVideoX, LTX-Video) is rapidly closing the gap with Sora / Veo / Movie Gen.
- Post-training (DPO for diffusion, RLHF, reward models) drives most quality improvements at the application layer.
Future Directions:
- One-step video distillation at production scale (APT, 2025).
- Diffusion LLMs scaling past 70B parameters with multi-token-per-step decoding.
- Unified multimodal foundation models that share a single denoising backbone across image, video, audio, and text.
- Edge / mobile deployment via aggressive distillation + quantisation (SD Turbo, LCM-LoRA, Mercury).
- Scientific applications: protein design (RFdiffusion), molecular docking (DiffDock), materials science, medical imaging.
- Better theoretical understanding of why DiTs scale, why REPA accelerates training, and why adversarial distillation works so well.
Diffusion models in 2026 are no longer a single algorithmic family — they are a shared mathematical substrate (forward corruption + learned reverse) that absorbs ideas from GANs (adversarial post-training), VAEs (latent codec front-ends), autoregressive models (masked discrete diffusion), and normalising flows (rectified flow, flow matching). As architectures scale and sampling becomes more efficient, they will likely remain dominant for visual and multimodal generation while expanding into new modalities and scientific applications.
Next Steps¤
-
Practical usage guide with implementation examples and training workflows
-
Complete API documentation for DDPM, DDIM, Latent Diffusion, and variants
-
Step-by-step hands-on tutorial: train a diffusion model on MNIST from scratch
-
Explore Stable Diffusion, video diffusion, and state-of-the-art architectures
Further Reading¤
Seminal Papers (Must Read)¤
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"
arXiv:1503.03585 | ICML 2015
The foundational paper introducing diffusion probabilistic models
Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models"
arXiv:2006.11239 | NeurIPS 2020
DDPM: Made diffusion models practical with simplified training objective
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations"
arXiv:2011.13456 | ICLR 2021 Outstanding Paper
Unified framework via SDEs, probability flow ODEs
Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models"
arXiv:2010.02502 | ICLR 2021
DDIM: Fast deterministic sampling without retraining
Nichol, A., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models"
arXiv:2102.09672 | ICML 2021
Learned variances, cosine schedule, hybrid objective
Dhariwal, P., & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis"
arXiv:2105.05233 | NeurIPS 2021
Showed diffusion superiority, introduced classifier guidance
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models"
arXiv:2112.10752 | CVPR 2022
Latent diffusion / Stable Diffusion: 2.7× speedup in VAE latent space
Tutorial Papers and Surveys¤
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., ... & Cui, B. (2023). Survey paper on diffusion methods and applications
arXiv:2209.00796
Complete 150+ page survey covering theory and applications
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P. A., & Li, S. Z. (2023). "A Survey on Generative Diffusion Models"
arXiv:2209.02646
Covers mathematical foundations, applications, and future directions
Luo, C. (2022). "Understanding Diffusion Models: A Unified Perspective"
arXiv:2208.11970
Excellent tutorial connecting ELBO, score matching, and SDEs
Important Variants and Extensions¤
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents"
arXiv:2204.06125
DALL·E 2: CLIP prior diffusion for text-to-image
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., ... & Norouzi, M. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding"
arXiv:2205.11487 | NeurIPS 2022
Imagen: T5 text encoder with cascaded diffusion
Ho, J., & Salimans, T. (2022). "Classifier-Free Diffusion Guidance"
arXiv:2207.12598
Industry-standard guidance without auxiliary classifiers
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., ... & Rombach, R. (2023). "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis"
arXiv:2307.01952
SDXL: 2.3B parameter upgrade to Stable Diffusion
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"
arXiv:2403.03206
Stable Diffusion 3: Multimodal diffusion transformer
Sampling and Acceleration¤
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps"
arXiv:2206.00927 | NeurIPS 2022
10-20 step high-quality sampling via ODE solvers
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2023). "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models"
arXiv:2211.01095
Improved stability for classifier-free guidance
Liu, L., Ren, Y., Lin, Z., & Zhao, Z. (2022). "Pseudo Numerical Methods for Diffusion Models on Manifolds"
arXiv:2202.09778 | ICLR 2022
PNDM: 20× speedup with quality improvement
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Consistency Models"
arXiv:2303.01469 | ICML 2023
One-step generation via consistency distillation
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Improved Techniques for Training Consistency Models"
arXiv:2310.14189 | ICLR 2024 Oral
FID 2.51 in one step, 2.24 in two steps
Salimans, T., & Ho, J. (2022). "Progressive Distillation for Fast Sampling of Diffusion Models"
arXiv:2202.00512 | ICLR 2022
Iteratively halve sampling steps through distillation
Architecture Innovations¤
Peebles, W., & Xie, S. (2023). "Scalable Diffusion Models with Transformers"
arXiv:2212.09748 | ICCV 2023
DiT: Replaces U-Net with transformer, scales to billions of parameters
Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., & Zhu, J. (2023). "All are Worth Words: A ViT Backbone for Diffusion Models"
arXiv:2209.12152 | CVPR 2023
U-ViT: Combines ViT with U-Net skip connections
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models"
arXiv:2206.00364 | NeurIPS 2022
EDM: Unified framework, optimal preconditioning, FID 1.79
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., & Laine, S. (2024). "Analyzing and Improving the Training Dynamics of Diffusion Models"
arXiv:2312.02696 | CVPR 2024 Oral
EDM2: FID 1.81 on ImageNet-512, post-hoc EMA
Crowson, K., et al. (2024). "Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers" (HDiT, ICML)
arXiv:2401.11605
Hierarchical pixel-space DiT with neighborhood attention; linear in pixel count, SOTA on FFHQ-1024².
Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-Eijnden, E., & Xie, S. (2024). "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers" (ECCV)
arXiv:2401.08740
DiT backbone trained with stochastic interpolants; FID 2.06 / 2.62 on ImageNet 256² / 512².
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., ... & Li, Z. (2023). "PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis"
arXiv:2310.00426
0.6B-parameter DiT with cross-attention + adaLN-single; matches SD 2.0 quality at ~12% of the compute.
Chen, J., et al. (2024). "PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation"
arXiv:2403.04692
Direct 4K generation via weak-to-strong curriculum and KV-compression attention.
Tencent (2024). "Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding"
arXiv:2405.08748
1.5B-parameter MMDiT-style backbone for English/Chinese T2I.
Lu, Z., et al. (2024). "FiT: Flexible Vision Transformer for Diffusion Model" (ICML Spotlight)
arXiv:2402.12376
Variable-resolution / arbitrary-aspect-ratio DiT via 2-D RoPE + masked MHSA.
Lu, Z., et al. (2024). "FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model"
arXiv:2410.13925
2× faster convergence; introduces adaLN-LoRA, Q-K norm, rectified-flow schedule.
Fei, Z., et al. (2024). "Scaling Diffusion Transformers to 16 Billion Parameters" (DiT-MoE)
arXiv:2407.11633
Sparse Mixture-of-Experts DiT; 16.5B params, 3.1B active, FID 1.80 on ImageNet 512².
Park, B., et al. (2024). "Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts" (ECCV)
openreview Switch-DiT
Per-timestep expert routing for DiT.
Apple (2024). "EC-DiT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing"
machinelearning.apple.com/research/ec-dit
Expert-choice routing for diffusion MoE.
Teng, Y., et al. (2024). "DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis"
arXiv:2405.14224
Mamba SSM blocks replace attention; linear scaling at high resolution.
Phung, H., et al. (2024). "DiMSUM: Diffusion Mamba — A Scalable and Unified Spatial-Frequency Method for Image Generation" (NeurIPS)
openreview DiMSUM
Wavelet + Mamba diffusion backbone.
Zhu, L., et al. (2025). "DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention" (CVPR)
arXiv:2405.18428
Gated linear attention for diffusion; 1.8× faster than Flash-DiT at 2048².
Suresh, S., et al. (2025). "EDiT: Efficient Diffusion Transformers with Linear Compressed Attention"
arXiv:2503.16726
MM-EDiT: linear-time image-to-image attention paired with softmax text-image attention.
Zhao, Y., et al. (2024–2025). "DyDiT / DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics"
arXiv:2504.06803
Adaptive computation along timestep + spatial axes; 51% FLOPs reduction on DiT-XL.
Ma, X., et al. (2024). "Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching" (NeurIPS)
proceedings/Paper-Conference.pdf
Static layer caching across denoising steps.
(2026). "DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity" (ICLR 2026)
arXiv:2604.03674
Differentiable layer-wise token sparsity for DiT acceleration.
Chen, C., et al. (2025). "DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text-to-Image Generation"
arXiv:2503.10618
66% size reduction on MMDiT via layer-wise parameter sharing.
Alpha-VLLM (2024). "Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT"
arXiv:2406.18583
Unified multimodal MMDiT-style backbone (image, video, audio, 3D).
(2025). "Lumina-Image 2.0: A Unified and Efficient Image Generative Framework"
arXiv:2503.21758
Production iteration of Lumina-Next with TeaCache acceleration.
(ICLR Blogposts 2026). "From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models (2021–2025)"
iclr-blogposts.github.io
Comprehensive timeline of the U-Net → DiT → MMDiT progression.
Wei, Y., et al. (2024). "Unveiling the Secret of AdaLN-Zero in Diffusion Transformer"
openreview AdaLN-Zero
Theoretical analysis of why zero-init adaLN converges so much faster.
Tian, K., et al. (2024). "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (VAR, NeurIPS Best Paper)
arXiv:2404.02905
Next-scale autoregression beats DiT on ImageNet 256² (FID 1.73 vs 2.27); 20× faster inference.
Li, T., et al. (2024). "Autoregressive Image Generation without Vector Quantization" (MAR)
arXiv:2406.11838
Continuous-token AR with a per-token diffusion head; matches diffusion quality.
Attention Patterns and Position Encodings¤
Vaswani, A., et al. (2017). "Attention Is All You Need" (NeurIPS)
arXiv:1706.03762
The original transformer / multi-head self-attention; foundational for every diffusion DiT.
Su, J., et al. (2024). "RoFormer: Enhanced Transformer with Rotary Position Embedding" (RoPE)
arXiv:2104.09864
Rotary position embeddings — the dominant choice in 2026 DiTs; 2-D variants underpin FiT and FLUX.
Dehghani, M., et al. (2023). "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution" (NeurIPS)
arXiv:2307.06304
Variable-resolution / variable-aspect-ratio ViT; the basis of multi-aspect-ratio diffusion DiTs.
Liu, Z., et al. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (ICCV Best Paper)
arXiv:2103.14030
Shifted-window self-attention for vision; the conceptual predecessor of NA / HDiT's local attention.
Hassani, A., et al. (2023). "Neighborhood Attention Transformer" (NAT, CVPR)
arXiv:2204.07143
Linear-complexity sliding-window attention; the attention pattern HDiT uses at high resolution.
Katharopoulos, A., et al. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML)
arXiv:2006.16236
Foundational \(O(N)\) kernelised attention.
Choromanski, K., et al. (2021). "Rethinking Attention with Performers" (ICLR)
arXiv:2009.14794
Random-feature linear attention with FAVOR+; broad ancestor of modern linear-attention diffusion backbones.
Attention Implementation and Inference Acceleration¤
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS)
arXiv:2205.14135
IO-aware tiled exact attention; the basis of essentially every modern DiT training stack.
Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"
arXiv:2307.08691
2× faster than FA-1; ~70 % of A100 theoretical FLOPS.
Shah, J., et al. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision"
arXiv:2407.08608
Hopper-optimised; uses async Tensor Cores and TMA for FP8 / BF16 long-context attention.
Bolya, D., & Hoffman, J. (2023). "Token Merging for Fast Stable Diffusion" (CVPRW)
arXiv:2303.17604
ToMeSD — 2× speedup, 5.6× memory reduction on Stable Diffusion via training-free token merging.
(2025). "ToMA: Token Merge with Attention for Diffusion Models"
arXiv:2509.10918
Token merging adapted specifically for diffusion transformers.
Ma, X., Fang, G., & Wang, X. (2024). "DeepCache: Accelerating Diffusion Models for Free" (CVPR)
arXiv:2312.00858
Training-free U-Net feature caching across denoising steps; ~2× speedup.
(2025). "LiteAttention: A Temporal Sparse Attention for Diffusion Transformers"
arXiv:2511.11062
Sparse temporal attention for video DiTs.
(2026). "FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation"
arXiv:2603.09721
Frame-level matrix attention for high-quality, efficient video diffusion.
Flow Matching and Optimal Transport¤
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). "Flow Matching for Generative Modeling"
arXiv:2210.02747 | ICLR 2023
Simulation-free training of continuous normalizing flows
Liu, X., Gong, C., & Liu, Q. (2023). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
arXiv:2209.03003 | ICLR 2023
Straight-line ODE paths, reflow operation
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"
arXiv:2403.03206
Stable Diffusion 3 technical report
Albergo, M. S., & Vanden-Eijnden, E. (2023). "Building Normalizing Flows with Stochastic Interpolants"
arXiv:2303.08797
Stochastic interpolants framework — generalises flow matching with stochastic couplings.
Black Forest Labs (2024). "FLUX.1: Open-weights Rectified-flow Transformer for Text-to-Image"
github.com/black-forest-labs/flux
12 B-parameter rectified-flow transformer; FLUX.1 [schnell] is a 1–4-step open-weights model.
Distillation and Few-Step Generation (2023–2026)¤
Luo, S., Tan, Y., Huang, L., Li, J., & Zhao, H. (2023). "Latent Consistency Models: Synthesizing High-Resolution Images with Few-step Inference"
arXiv:2310.04378
2–4-step distillation of any latent diffusion model; ~32 A100-hours.
Luo, S., et al. (2023). "LCM-LoRA: A Universal Stable-Diffusion Acceleration Module"
arXiv:2311.05556
LoRA-based plug-in accelerator that drops into any SD fine-tune without retraining.
Sauer, A., Lorenz, D., Blattmann, A., & Rombach, R. (2023). "Adversarial Diffusion Distillation" (ADD / SDXL Turbo)
arXiv:2311.17042
1–4-step distillation of SDXL with a frozen DINOv2 discriminator; the recipe behind SDXL Turbo.
Zhou, M., Zheng, H., Wang, Z., Yin, M., & Huang, H. (2024). "Score Identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation" (ICML)
arXiv:2404.04057
SiD/SiDA — three score identities yield exponential FID decay.
Yin, T., et al. (2024). "Improved Distribution Matching Distillation for Fast Image Synthesis" (DMD2, NeurIPS Oral)
arXiv:2405.14867
Current SOTA one-step image generation: FID 1.28 (ImageNet 64²).
Kim, D., et al. (2024). "Consistency Trajectory Models" (CTM)
arXiv:2310.02279
FID 1.73 in one step on CIFAR-10.
Kang, M., et al. (2024). "Distilling Diffusion Models into Conditional GANs" (Diffusion2GAN)
arXiv:2405.05967
1-step megapixel synthesis via E-LatentLPIPS + GAN distillation.
Lin, S., et al. (2025). "Diffusion Adversarial Post-Training for One-Step Video Generation" (APT)
arXiv:2501.08316
Single-step adversarial distillation for video diffusion.
Training Acceleration and Representation Learning¤
Hang, T., et al. (2023). "Efficient Diffusion Training via Min-SNR Weighting Strategy"
arXiv:2303.09556
3.4× training speedup via SNR-clipped loss weighting; per-parameterisation weight formulae.
Choi, J., et al. (2022). "Perception Prioritized Training of Diffusion Models" (P2-weighting, CVPR)
arXiv:2204.00227
SNR-based loss weighting that prioritises perceptually-rich timesteps; widely used on ImageNet 256².
Yu, S., et al. (2024). "Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think" (REPA, ICLR'25 Oral)
arXiv:2410.06940
17.5× faster SiT-XL training via DINOv2 feature alignment; FID 1.42 on ImageNet 256².
Time / Noise-Level Conditioning¤
Tancik, M., et al. (2020). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" (NeurIPS)
arXiv:2006.10739
Random Fourier features for low-dim scalar inputs; the basis of NCSN++ / EDM time embeddings.
Kingma, D. P., et al. (2021). "Variational Diffusion Models" (VDM, NeurIPS)
arXiv:2107.00630
Argues for log-SNR as the natural time coordinate; learnable noise schedule.
Sun, Q., et al. (2025). "Is Noise Conditioning Necessary for Denoising Generative Models?" (ICML)
arXiv:2502.13129
Surprising empirical result: removing time conditioning altogether produces only a small quality gap.
(NeurIPS 2025). "Is Noise Conditioning Necessary? A Unified Theory of Unconditional Graph Diffusion Models"
arXiv:2505.22935
Same finding holds in graph diffusion, with 4–6 % parameter and 8–10 % compute savings.
Diffusion Language Models (2024–2026)¤
Austin, J., et al. (2021). "Structured Denoising Diffusion Models in Discrete State-Spaces" (D3PM)
arXiv:2107.03006
Foundational discrete diffusion framework that LLaDA / Mercury build on.
Nie, S., et al. (2025). "Large Language Diffusion Models" (LLaDA)
arXiv:2502.09992
8B-parameter masked-diffusion LLM; competitive with LLaMA3-8B on instruction following.
Inception Labs (2025). "Mercury: Ultra-Fast Language Models Based on Diffusion"
arXiv:2506.17298
First commercial-scale diffusion LLM; 92 ms latency vs ~450 ms for AR baselines.
(2025–26). "ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding"
arXiv:2512.13586
Interleaves diffusion-based selection with AR infilling; 18× speedup with KV-cache reuse.
VILA-Lab (2025). "Awesome-DLMs: A Survey on Diffusion Language Models" (continuously updated)
github.com/VILA-Lab/Awesome-DLMs
Tracking the diffusion-LLM literature.
Post-Training and Alignment¤
Wallace, B., et al. (2024). "Diffusion Model Alignment Using Direct Preference Optimization" (Diffusion-DPO, CVPR 2024)
arXiv:2311.12908
DPO for diffusion — no separate reward model needed.
Black, K., et al. (2024). "Training Diffusion Models with Reinforcement Learning" (DDPO)
arXiv:2305.13301
REINFORCE-style RL for diffusion against CLIP / aesthetic / OCR rewards.
Recent Surveys (last 6 months)¤
(2024–2025). "Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices"
arXiv:2410.11795
Algorithm-, system-, and framework-level taxonomy of efficiency techniques.
(2025). "Efficient Diffusion Models: A Survey"
arXiv:2502.06805
Sampling acceleration, model compression, distillation.
(2025). "Survey of Video Diffusion Models: Foundations, Implementations, and Applications"
arXiv:2504.16081
Comprehensive review of diffusion-based video generation.
Video and 3D Generation¤
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., ... & Salimans, T. (2022). "Imagen Video: High Definition Video Generation with Diffusion Models"
arXiv:2210.02303
Cascaded video diffusion with temporal attention
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., ... & Ramesh, A. (2024). "Video Generation Models as World Simulators"
OpenAI Technical Report
Sora: Diffusion transformer on spacetime patches
Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). "DreamFusion: Text-to-3D using 2D Diffusion"
arXiv:2209.14988
Score Distillation Sampling for 3D generation
Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., ... & Fidler, S. (2023). "Magic3D: High-Resolution Text-to-3D Content Creation"
arXiv:2211.10440 | CVPR 2023
Two-stage coarse-to-fine 3D generation
Scientific Applications¤
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., ... & Baker, D. (2023). "De novo design of protein structure and function with RFdiffusion"
Nature 620, 1089–1100
Protein design via diffusion achieving experimental validation
Corso, G., Stärk, H., Jing, B., Barzilay, R., & Jaakkola, T. (2022). "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
arXiv:2210.01776 | ICLR 2023
Molecular docking as generative modeling
Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. (2022). "Equivariant Diffusion for Molecule Generation in 3D"
arXiv:2203.17003 | ICML 2022
E(3)-equivariant diffusion for drug molecule design
Image Editing and Control¤
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). "RePaint: Inpainting using Denoising Diffusion Probabilistic Models"
arXiv:2201.09865 | CVPR 2022
Mask-agnostic inpainting with pretrained models
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. (2022). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations"
arXiv:2108.01073 | ICLR 2022
Edit images via noise addition and denoising
Brooks, T., Holynski, A., & Efros, A. A. (2023). "InstructPix2Pix: Learning to Follow Image Editing Instructions"
arXiv:2211.09800 | CVPR 2023
Edit images from natural language instructions
Zhang, L., Rao, A., & Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models"
arXiv:2302.05543 | ICCV 2023
ControlNet: Spatial conditioning with edges, depth, pose
Online Resources and Code¤
Lilian Weng's Blog: "What are Diffusion Models?"
lilianweng.github.io/posts/2021-07-11-diffusion-models
Complete blog post with excellent visualizations and intuitions
Yang Song's Blog: "Generative Modeling by Estimating Gradients of the Data Distribution"
yang-song.net/blog/2021/score
Deep dive into score-based models and SDEs
Hugging Face Diffusers Library
github.com/huggingface/diffusers
Production-ready implementations: DDPM, DDIM, Stable Diffusion, ControlNet
Stability AI: Stable Diffusion Official Repository
github.com/Stability-AI/generative-models
Official implementation of Stable Diffusion models
CompVis: Latent Diffusion Models
github.com/CompVis/latent-diffusion
Original latent diffusion implementation
Denoising Diffusion PyTorch
github.com/lucidrains/denoising-diffusion-pytorch
Clean, well-documented PyTorch implementations
Books and Complete Tutorials¤
Prince, S. J. D. (2023). "Understanding Deep Learning"
Chapter on Diffusion Models | udlbook.github.io/udlbook
Excellent pedagogical treatment with visualizations
Murphy, K. P. (2023). "Probabilistic Machine Learning: Advanced Topics"
Chapter on Score-Based and Diffusion Models | MIT Press
Rigorous mathematical treatment
Hugging Face Diffusion Models Course
huggingface.co/learn/diffusion-course
Hands-on tutorials from basics to advanced topics
Ready to build with diffusion models? Start with the Diffusion User Guide for practical implementations, check the API Reference for complete documentation, or dive into the MNIST Tutorial to train your first diffusion model from scratch!