Generative Models: A Unified View¤
-
One Map for Six Families
A single navigable view of VAEs, GANs, Diffusion, Flows, EBMs, and Autoregressive models — how they relate, where they overlap, and how to choose between them
-
Modern Unifications
Score-based diffusion as time-indexed EBMs, Gaussian-path diffusion and flow matching as shared probability-path algorithms, ARMs ↔ EBMs, Latent Diffusion = codec + generative prior, and more — with the assumptions and caveats that make each equivalence useful
-
A Thinker's Guide
Beyond a cheat sheet: conceptual lenses, scaling laws, metric failures, mechanistic interpretability, the Platonic Representation Hypothesis and its counterpoints, and a curated list of open theoretical problems
-
Frontier Research Directions
Concrete novel-research seeds that fall out of the cross-family view — where a graduate student or research engineer could plausibly contribute in 2026
Why This Document Exists¤
The six "Explained" pages in this directory each go deep on one family of generative models:
- VAE — encode-decode with a probabilistic bottleneck
- GAN — generator vs discriminator
- Diffusion — learn to denoise
- Flow — invertible transformations + flow matching
- EBM — learn an energy landscape
- Autoregressive — chain-rule factorisation
Each is excellent on its own, but a new practitioner — and an experienced one wanting to spot novel research directions — needs more than six independent deep-dives. They need a single map showing how the families relate, what they share, what they don't, and how the 2024–2026 wave of unifications has redrawn the territory. They also need a tour of the intellectual frontier where these unifications are forcing fundamental rethinks of evaluation, interpretability, and the very meaning of "generative".
That is the job of this document. It is the recommended first page to read in this directory; after this overview, the six deep-dives can be visited in any order. Treat the later sections (the Platonic Hypothesis, Scaling Laws Across Families, the Evaluation Crisis, Open Theoretical Problems, and Frontier Research Directions) as a thinker's reading list rather than a reference; they are designed to seed new ideas.
How to Read This Document¤
This page is structured as three concentric circles of detail. Pick the circle that matches your current goal:
graph TD
R1["<b>Circle 1 — Orientation</b><br/>Cheat Sheet → Three Lenses → Family Tree<br/><i>(skim if pressed for time)</i>"]
R2["<b>Circle 2 — Foundations</b><br/>Common Math → Training Objectives →<br/>Sampling → Modern Unifications →<br/>Probabilistic Graphical View → Conditional Generation"]
R3["<b>Circle 3 — Frontier & Practice</b><br/>Platonic Rep → Scaling Laws →<br/>Evaluation Crisis → Mech Interp →<br/>Reasoning → World Models →<br/>Failure Modes → Decision Guide →<br/>Open Problems → Frontier Research"]
R1 --> R2
R2 --> R3
style R1 fill:#fff9c4
style R2 fill:#e1f5ff
style R3 fill:#f3e5f5
Recommended order through the directory (after this page):
- Autoregressive Models — the simplest factorisation; cleanest example of "tractable likelihood by construction".
- VAEs — your first taste of latent variables and the ELBO. You'll need the ELBO again for diffusion.
- GANs — the canonical implicit density-ratio estimator. Once you understand GANs, adversarial distillation in modern diffusion makes sense.
- Energy-Based Models — the mathematical bedrock. Once you understand energy / score / log-density, diffusion and flow matching become "just an EBM with extra structure".
- Diffusion Models — the largest deep-dive; central to image / video / audio generation. Builds on score matching (EBM) and the ELBO (VAE).
- Normalizing Flows — from classical change-of-variables to modern flow matching; Gaussian-path flow matching and diffusion share a practical algorithmic core.
If you only have time for two deep-dives, read Diffusion and Autoregressive. The other pages are not obsolete or reducible footnotes, but many modern production stacks can be understood as hybrids around those two endpoints: iterative denoising / transport on one side, and chain-rule token generation on the other.
Prerequisites: working comfort with probability, calculus, and at least one of (PyTorch / JAX / TensorFlow). You don't need to know any specific generative-model family before starting — but the family-specific deep-dives assume you've at least skimmed Circle 1 of this page.
The Six Families: A One-Page Cheat Sheet¤
| Family | Core idea | Trains via | Generates by | Likelihood | Primary modern role (2026) |
|---|---|---|---|---|---|
| VAE | Encode to a probabilistic bottleneck, decode back | ELBO (reconstruction + KL) | Sample latent → decoder forward pass | Lower bound (ELBO) | Codec for latent diffusion / flow and video patch latents |
| GAN | Two-network minimax: generator vs discriminator | Adversarial (saddle-point) | Generator forward pass | None (implicit) | Real-time generation; adversarial-distillation post-training of diffusion models |
| Diffusion | Learn to reverse a fixed noising process | MSE on noise / score / velocity | 20–1000 ODE / SDE steps (or 1–4 distilled) | ELBO / ODE likelihood estimates | Image, video, 3-D generation; latent diffusion remains a dominant visual stack |
| Flow | Transform a base distribution via tractable bijections (classical) or a vector field (flow matching) | Maximum likelihood + Jacobian (classical) or vector-field regression (FM) | One forward pass (classical) or 1–50 ODE steps (FM) | Exact for classical flows; ODE divergence integral for CNF / FM | Rectified-flow / flow-matching transformers power SD3, FLUX.1, HunyuanVideo |
| EBM | Score every point with a scalar energy; low energy = high probability | Contrastive divergence, score matching, or Energy Matching | MCMC / Langevin / gradient descent on the landscape | Unnormalised; \(Z\) usually intractable | Density estimation, OOD detection; Equilibrium Matching (FID 1.90 on ImageNet 256² in 2025) |
| Autoregressive | Chain-rule factorisation: predict each element given its predecessors | Cross-entropy on next-token prediction | Sequential decoding (or speculative / multi-token) | Exact via the chain rule | LLMs (GPT-4, Llama-3, DeepSeek-V3); VAR for images |
A first reading suggests these are six unrelated approaches. The 2022–2026 literature has shown that they are not — see Modern Unifications below.
Three Lenses for Looking at Generative Models¤
A useful exercise before going deeper: every generative model can be classified along three orthogonal axes. Once you learn to think in these axes, the historical "VAE vs GAN vs diffusion" debates lose most of their force.
Lens 1 — What Divergence Are We Minimising?¤
Every generative model is implicitly trying to minimise a divergence between \(p_\theta\) and \(p_{\mathrm{data}}\). The choice of divergence is the deepest design decision and it predicts most of a family's empirical behaviour:
| Family | Divergence minimised (formal or implicit) | Behavioural consequence |
|---|---|---|
| AR (MLE) | Forward KL: \(D_{\mathrm{KL}}(p_{\mathrm{data}}\,\|\,p_\theta)\) | Mode covering — assigns non-zero mass everywhere \(p_{\mathrm{data}}\) has mass |
| VAE (tight ELBO) | Forward KL on data + a regularising encoder term | Mode covering, but encoder posterior collapse can leak diversity |
| GAN (vanilla — saturating) | Jensen-Shannon | Symmetric; vanishing-gradient pathology when supports separate |
| GAN (non-saturating) | Same optimal discriminator fixed point as vanilla GAN, but a generator loss chosen to avoid saturation | Better gradients early in training; mode collapse still comes from adversarial dynamics, capacity limits, support mismatch, and critic overfitting |
| WGAN / WGAN-GP | Wasserstein-1 (earth mover) | Smooth gradients even when supports don't overlap |
| Diffusion (DDPM, DSM) | Fisher divergence between scores (across all noise levels) | Smooth, mode-covering; the "regression-not-classification" feel |
| Flow Matching | Conditional vector-field MSE | Shares an algorithmic core with Gaussian-path diffusion; broader paths / couplings are not identical to DDPM |
| EBM (CD) | Forward KL minimised by SGD; CD's MCMC truncation introduces gradient bias | Mode covering when MCMC mixes; pseudo-mode collapse when it doesn't |
| EBM (Energy Matching) | OT cost + entropic regulariser | Both straight-path transport and equilibrium structure |
| f-GAN | Any f-divergence (KL, reverse-KL, Pearson, …) | Custom-tunable mode behaviour |
Why this matters: if you ever wonder why GANs mode-collapse and diffusion usually does not, the answer starts in this column but does not end there. The original minimax GAN has a JS-divergence support-mismatch problem; the non-saturating loss fixes much of the vanishing-gradient issue without changing the ideal fixed point; and practical mode collapse also depends on discriminator capacity, update balance, regularisation, and finite-batch dynamics. WGAN, R1 / R2 regularisation, relativistic losses, and modern adversarial-distillation methods all improve the gradient geometry rather than magically changing GANs into likelihood models.
The forward-vs-reverse-KL story is one of the deepest symmetries in the field. Forward KL (\(p_d \| p_\theta\)) is zero-avoiding — assigning low probability to a mode that exists in \(p_d\) is heavily penalised, so MLE-style models (AR, VAE, classical flows) cover all modes but may also assign mass to regions that aren't in \(p_d\) (sometimes called "spreading"). Reverse KL (\(p_\theta \| p_d\)) is zero-forcing — assigning probability to a region without data is heavily penalised, so reverse-KL-like objectives and reward-shaped policies are sharper but may collapse to a subset of modes.
Lens 2 — What Forward Corruption Process Are We Inverting?¤
Almost every generative model can be written as "invert a forward corruption". The choice of corruption is the family:
| Family | Forward corruption \(q(z\mid x)\) or \(q(x_t\mid x_0)\) | What gets inverted |
|---|---|---|
| VAE | Encoder \(q_\phi(z \mid x)\) — learned Gaussian | Decoder generates \(x\) from \(z\) |
| Diffusion | Fixed Gaussian noising chain \(x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon\) | Reverse diffusion |
| Flow Matching | Fixed path \(\psi_t = (1-t)\,x_0 + t\,x_1\) (rectified) or any other interpolant | The vector field \(u_t\) |
| EBM (DRL) | Same fixed chain as DDPM, applied to a sequence of EBMs | Per-scale recovery |
| AR (masked-diffusion LM) | Mask tokens at rate \(t\) | Predict masked positions |
| AR (next-token) | "Erase tokens to the right of position \(i\)" | Predict the next token |
| VAR | Down-sample tokens to a coarser scale | Predict the next-finer scale |
| GAN | None — there is no forward corruption | Direct distributional matching |
This lens explains a recent surprise: AR is just diffusion with a degenerate corruption process (reveal one position, reveal two, reveal three…). Once you accept that framing, Wang et al., 2026 — ARMs are Secretly EBMs and the LLaDA / Mercury / Dream diffusion-LM line are not surprising — they are the natural continuum between two corruption processes that happen to live at opposite ends of a spectrum (one position at a time vs all positions independently noised).
Lens 3 — Where Is the Computational Bottleneck?¤
Each family has exactly one computational quantity that is hard to compute, and the rest of the family's design is about avoiding it:
| Family | Bottleneck | How avoided |
|---|---|---|
| AR | Sequential decode (length \(N\)) | Speculative decoding, MTP, parallel decoding |
| VAE | Posterior \(q(z\mid x)\) for general decoders | Amortised inference net + ELBO |
| GAN | Density of \(p_g\) (implicit, not even queryable) | Avoid it entirely; train against a critic |
| Diffusion | Score / data distribution at every noise level | Score matching gives an unbiased gradient |
| Flow (classical) | \(\det J\) (general \(O(D^3)\)) | Triangular / structured Jacobians |
| Flow Matching | The unknown probability path | Conditional flow matching: pick a tractable conditional and marginalise |
| EBM | Partition function \(Z = \int e^{-E}\) | CD / NCE / score matching / ratio estimation |
Read together with Lens 2, this gives you a mental decision tree: which corruption is most natural for my data? → which bottleneck am I willing to live with? → family choice falls out.
A Common Mathematical Framework¤
Every generative model fits the same pattern: it learns an approximation \(p_\theta(x)\) to a data distribution \(p_{\mathrm{data}}(x)\). The families differ in how they parameterise \(p_\theta\) and how they avoid the otherwise-intractable normalising integral \(\int p_\theta(x)\, dx\).
The Fundamental Trilemma¤
Generative modelling has been called a trilemma among three desiderata, and you historically pick at most two (Xiao et al., 2022):
graph TD
Q["The Generative<br/>Trilemma"]
Q --> S["Sample Quality<br/>(perceptual fidelity)"]
Q --> D["Sample Diversity<br/>(mode coverage)"]
Q --> F["Fast Inference<br/>(few NFEs)"]
G["GANs"] -.->|"high quality + fast,<br/>low diversity"| Q
DM["Diffusion"] -.->|"high quality + diverse,<br/>slow"| Q
V["VAEs"] -.->|"diverse + fast,<br/>blurry"| Q
style Q fill:#fff3cd
style G fill:#ffebee
style DM fill:#e1f5ff
style V fill:#e8f5e9
The 2024–2026 wave has weakened the old trilemma rather than erased it. Distilled diffusion reaches 1–4 NFEs at near-teacher quality, Mean Flows report 1-NFE FID 3.43 on ImageNet 256², R3GAN revisits pure-GAN stability and mode coverage, Equilibrium Matching reports EBM FID 1.90 in optimisation-style sampling, and VAR reports AR images at FID 1.73 with much faster inference than its diffusion baselines. Each result breaks a historical stereotype; none removes the need to check data regime, compute budget, metric choice, and failure modes.
Five Strategies to Avoid the Normaliser¤
The technical heart of every family is how it deals with \(Z = \int p_\theta(x)\, dx\):
| Strategy | Families | Mechanism |
|---|---|---|
| Factorise \(p_\theta\) so \(Z\) is trivially 1 | Autoregressive, Classical Flows | Chain rule (AR) / change-of-variables with structured Jacobian (flows) |
| Skip \(p_\theta\), learn its gradient instead | Diffusion, Flow Matching, Score-based | Learn \(\nabla_x \log p_t(x)\) or a vector field; sample by ODE/SDE integration |
| Skip \(p_\theta\), learn a 0/1 classifier | GAN, NCE | Train a discriminator / two-sample classifier; implicit density |
| Keep \(Z\) but estimate its gradient | EBM (contrastive divergence, NCE) | \(\nabla \log Z\) is an expectation under \(p_\theta\), estimated by MCMC or noise contrast |
| Optimise a lower bound | VAE | ELBO replaces \(\log p_\theta(x)\) with a tractable lower bound |
This taxonomy is more useful than the historical "generative vs discriminative" cut: it tells you immediately what the practical bottleneck of each family is.
Why the Manifold Hypothesis Forces a Family Choice¤
Real-world data lives on a low-dimensional manifold embedded in a much higher-dimensional ambient space — natural images use \(D \approx 200{,}000\) pixels but appear to lie on a manifold of intrinsic dimension a few hundred (Pope et al., 2021). This single fact explains most of the empirical asymmetries between families:
- Classical flows struggle on images because they cannot dimensionally compress: the bijection forces \(D_{\mathrm{model}} = D_{\mathrm{data}}\), wasting capacity on the orthogonal-to-manifold directions.
- Latent diffusion / latent flow matching dominate production because the VAE codec compresses to the manifold's intrinsic dimension first, then generates on the compressed representation.
- GANs trained on natural images implicitly learn the manifold but the JS divergence has no gradient between two distributions whose supports don't overlap — exactly the pathological case for low-dimensional manifolds in high-dimensional ambient spaces.
- Diffusion sidesteps the support-mismatch problem because the noised distributions \(q(x_t)\) are full-support Gaussians at every \(t > 0\).
- EBMs (classical) have no built-in mechanism to express low-dimensional support; Equilibrium Matching is explicitly designed so the learned gradient field has equilibria on the data manifold and grows away from it, giving a more manifold-aware energy landscape.
This is why in 2026 a dominant visual-generation recipe is codec + diffusion / flow matching: compress first, then learn the generative dynamics in a lower-dimensional latent where the geometry is better conditioned. It is a powerful recipe, not a theorem that all modalities should use a VAE codec.
The Probabilistic Graphical Model View¤
Each family corresponds to a graphical-model structure — a way of factoring the joint distribution over data and latents. This is the oldest unifying lens in machine learning and remains the cleanest:
| Family | Graphical structure | Joint factorisation |
|---|---|---|
| AR | DAG (directed chain) | \(p(x_1) \prod_i p(x_i \mid x_{<i})\) |
| VAE | Directed latent-variable model | \(p(z) \, p_\theta(x \mid z)\) with amortised inference \(q_\phi(z \mid x)\) |
| Flow (classical) | Deterministic invertible chain | \(p(z) \cdot \prod_k \lvert\det J_{f_k}\rvert^{-1}\) |
| Diffusion | Markov chain (length \(T\)) | \(p(x_T) \prod_t p_\theta(x_{t-1} \mid x_t)\) |
| EBM | Undirected (Markov random field) | \(\frac{1}{Z}\exp(-E_\theta(x))\) — no factorisation, just a global potential |
| GAN | None — implicit | No tractable joint; only a sampling procedure |
| JEM | Undirected joint over \((x, y)\) | \(\frac{1}{Z}\exp(-E_\theta(x, y))\) |
Key insight from the graphical view: directed models (AR, VAE, classical flows) get tractable likelihood almost for free, but constrain the conditional independence structure they can express. Undirected models (EBMs) can express any joint over \(x\) but pay for it with the partition function. Diffusion is the rare case of a directed model with an MRF-like global structure: the chain \(p(x_T) \prod_t p_\theta(x_{t-1} \mid x_t)\) is directed, but at every \(t\) a true score field can be written as the gradient of a log-density, equivalently a time-indexed energy \(E(x, t) = -\log p_t(x) + \mathrm{const}\). This is the precise sense in which score-based diffusion sits near EBMs in the Modern Unifications.
This view also explains a counter-intuitive fact: adding more latent variables makes likelihood harder to compute, not easier, because each latent introduces a marginalisation. VAEs side-step this with the ELBO; diffusion side-steps it by making the latents structured and tractable; classical flows side-step it by making the chain deterministic so there are no latents to marginalise over.
The Family Tree¤
Generative-model families share more ancestry than the names suggest. The diagram below traces the lineage from classical statistical modelling to the 2024–2026 unified frameworks.
graph TD
classical["Classical statistics<br/>(MLE, Boltzmann, MCMC)"]
classical --> ebm["Energy-Based Models<br/>(Hopfield, RBM, deep EBMs)"]
classical --> ar["Autoregressive<br/>(n-grams → RNN → Transformer → MoE)"]
classical --> flow["Normalizing Flows<br/>(NICE, RealNVP, Glow, NSF)"]
ebm --> sm["Score Matching<br/>(Hyvärinen 2005)"]
sm --> dsm["Denoising Score Matching<br/>(Vincent 2011)"]
dsm --> ncsn["NCSN / Score-SDE<br/>(Song & Ermon 2019/21)"]
ncsn --> diff["Diffusion Models<br/>(Sohl-Dickstein 2015,<br/>Ho et al. 2020)"]
flow --> cnf["Continuous Flows<br/>(Neural ODEs 2018,<br/>FFJORD 2018)"]
cnf --> fm["Flow Matching<br/>(Lipman 2023,<br/>Liu 2023, Albergo 2023)"]
diff -.->|"Gaussian path<br/>shared core"| fm
vae_root["Auto-Encoders<br/>(1980s)"] --> vae["VAE<br/>(Kingma & Welling 2014,<br/>Rezende et al. 2014)"]
classical --> vae_root
vae --> ldm["Latent Diffusion<br/>(Rombach 2022 = SD)"]
diff --> ldm
gan_root["Game theory /<br/>Density-ratio estimation"] --> gan["GANs<br/>(Goodfellow 2014)"]
gan --> add["Adversarial Diffusion<br/>Distillation (ADD, DMD2)"]
diff --> add
fm -. "Energy Matching<br/>Equilibrium Matching" .- ebm
ar -. "Wang et al. 2026<br/>bijection" .- ebm
ar --> var["VAR — Visual AR<br/>(Tian 2024 NeurIPS Best)"]
var -.-> diff
style classical fill:#fff9c4
style ebm fill:#e1f5ff
style ar fill:#fff3cd
style flow fill:#f3e5f5
style diff fill:#e8eaf6
style fm fill:#e8eaf6
style vae fill:#e8f5e9
style gan fill:#ffebee
style ldm fill:#e8eaf6
style add fill:#ffebee
style var fill:#fff3cd
The dotted lines are 2022–2026 unifications — pairs of families that share an objective, probability path, dual representation, or training trick under specific assumptions. The caveats matter: Gaussian-path diffusion is close to flow matching; arbitrary flow matching is broader. Score diffusion can be read as an EBM when the learned field is a score; an unconstrained vector field need not be conservative. ARMs and EBMs are linked in function space, but their sampling algorithms and engineering bottlenecks remain different.
Training Objectives Compared¤
The single most informative comparison across families is what each one minimises:
| Family | Objective | One-line summary |
|---|---|---|
| AR | \(-\sum_i \log p_\theta(x_i \mid x_{<i})\) | Cross-entropy on next-token prediction |
| VAE | \(-\mathbb{E}_q[\log p(x\mid z)] + D_{\mathrm{KL}}(q(z\mid x)\,\|\,p(z))\) | Reconstruction + KL (the ELBO) |
| GAN (orig) | \(\min_G \max_D \mathbb{E}_{p_d}[\log D] + \mathbb{E}_{p_g}[\log(1-D)]\) | Saddle-point on the JS divergence |
| GAN (WGAN) | \(\min_G \max_{D \in \mathrm{Lip}_1} \mathbb{E}_{p_d}[D] - \mathbb{E}_{p_g}[D]\) | Saddle-point on the Wasserstein-1 distance |
| Diffusion (DDPM) | \(\mathbb{E}_{t, x_0, \epsilon}\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\) | MSE between predicted and actual noise |
| Score-Based | \(\mathbb{E}_{p(x_t)}\,\|\nabla \log p_t - s_\theta(x_t, t)\|^2\) | Match the score of the noisy distribution |
| Flow Matching | \(\mathbb{E}_{t, x_0, x_1}\,\|u_t(x_t) - v_\theta(x_t, t)\|^2\) | Regress against a target vector field |
| Classical Flow | \(-\log p_z(f^{-1}(x)) - \log\,\lvert\det J_{f^{-1}}\rvert\) | Maximum likelihood + Jacobian determinant |
| EBM (CD) | \(\nabla E_\theta(x^+) - \nabla E_\theta(x^-)\) | Push energy down on data, up on MCMC samples |
| EBM (SM) | \(\mathbb{E}_{p_d}\bigl[-\mathrm{tr}(\nabla^2 E_\theta) + \tfrac{1}{2}\|\nabla E_\theta\|^2\bigr]\) | Match the score implicitly (Hyvärinen 2005) |
| EBM (NCE) | \(-\mathbb{E}_{p_d}[\log h_\theta] - \mathbb{E}_{p_n}[\log(1-h_\theta)]\) | Binary classify data vs noise |
| Energy Matching | OT vector-field loss far from data + entropic energy near data | Single scalar potential bridges flows and EBMs |
Three patterns jump out:
- MSE-against-something is the modern default. Diffusion, score-based, flow matching, and Energy Matching all reduce training to L2 regression on a target derived from a fixed corruption process — explaining why these methods are so much more stable than classical EBMs or GANs.
- Adversarial / contrastive losses are still useful, but as a post-training tool, not as the main objective. ADD, DMD2, and EBM contrastive divergence all use a learned discriminator-like critic on top of an already-good base model.
- Maximum likelihood is back at the top of the stack for any family that can manage it — autoregressive LLMs, flows, and (via the ELBO) VAEs all train this way and benefit from the regularising effects.
Inference / Sampling Compared¤
| Family | How a single sample is drawn | NFEs (typical) | Determinism |
|---|---|---|---|
| VAE | \(z \sim p(z)\), then \(x = \mathrm{Decoder}(z)\) | 1 | Stochastic (in \(z\)) |
| GAN | \(z \sim p(z)\), then \(x = G(z)\) | 1 | Stochastic |
| Classical Flow | \(z \sim p_z\), then \(x = f(z)\) | 1 | Stochastic |
| Flow Matching | Solve \(\dot x = v_\theta(x, t)\) from \(t=0\) to \(t=1\) | 10–50 (or 1 distilled / Mean Flow) | Deterministic ODE |
| Diffusion (SDE) | Reverse SDE / Langevin chain from \(x_T \sim \mathcal{N}(0, I)\) | 20–1000 | Stochastic |
| Diffusion (PF-ODE) | Probability-flow ODE | 10–50 (or 1–4 distilled) | Deterministic |
| EBM (classical) | Langevin / SGLD from random init | 50–1000 | Stochastic |
| EBM (Equilibrium Matching) | Adam-style gradient descent on the energy | 5–50 | Mostly deterministic |
| AR (greedy) | \(\mathrm{argmax}_{x_t} p(x_t\mid x_{<t})\), repeat \(N\) times | \(N\) tokens | Deterministic |
| AR (sampled) | Top-k / top-p / temperature, repeat \(N\) times | \(N\) tokens | Stochastic |
| AR (MTP / speculative) | Multi-token / draft-and-verify | \(\approx N/3\)–\(N/6\) | Same outputs, faster |
This table makes the modern reality precise: many modern generators trade one huge pass for iterative refinement in some space — diffusion refines \(x_t\) via a score / denoiser, EBMs move \(x\) via an energy gradient, flow matching integrates a velocity field, and autoregression refines a prefix one or more positions at a time. The similarity is useful for thinking about test-time compute; it does not make the algorithms interchangeable.
Modern Unifications (2022–2026)¤
The single most important development of the last four years is not that the families literally disappeared. It is that many boundaries became change-of-variable boundaries: the same probability path, score, energy, or latent representation can often be written in another family's language. Eight key results:
1. Score-based diffusion as time-indexed EBMs¤
When the learned field is a score, \(s_\theta(x, t) \approx \nabla_x \log p_t(x)\), it can be written as \(s_\theta(x, t) = -\nabla_x E_\theta(x, t)\) for a time-indexed energy up to an additive constant. The probability-flow ODE perspective (Song et al., 2021) makes the density / score connection explicit. Caveat: an arbitrary neural vector field is not automatically conservative; the EBM reading is cleanest for true scores or architectures / objectives that enforce integrability. → See diffusion-explained.md and ebm-explained.md.
2. Diffusion and Gaussian-path Flow Matching share an algorithmic core¤
Diffusion Meets Flow Matching, Flow Matching for Generative Modeling, and later theory show that common DDPM / score-SDE objectives and Gaussian-path flow matching can be written as closely related probability-path algorithms. They differ in schedule, target parameterisation, stochastic vs ODE sampling, and whether the chosen path is Gaussian / OT / rectified / discrete. This is why SD3 and FLUX.1 use rectified-flow training while still being called "diffusion" colloquially. → See flow-explained.md and diffusion-explained.md.
3. Latent Diffusion = VAE codec + diffusion prior¤
Stable Diffusion's recipe (Rombach et al., 2022) is "compress with an autoencoder, run diffusion in the latent". This remains a dominant production recipe for image generation and many video systems; Sora's public technical report describes compressed spacetime patch latents but not enough implementation detail to identify its exact codec. Modern image and video tokenizers are therefore a central part of the generative stack, not a preprocessing afterthought (VAE explainer's tokenizer-VAE section).
4. Adversarial Distillation = GAN + Diffusion¤
ADD / SDXL Turbo (Sauer et al., 2023) and DMD2 (Yin et al., 2024 NeurIPS Oral) put a GAN-style discriminator on top of a pretrained diffusion teacher to compress 50-step samplers down to 1–4 steps. The discriminator is a major component of several fast diffusion pipelines, alongside non-adversarial distillation families such as LCM and consistency models. → See gan-explained.md.
5. Energy Matching = Flow Matching + EBM¤
Energy Matching (Balcerak et al., NeurIPS 2025) parameterises a single scalar potential whose gradient is an OT vector field far from data and a Boltzmann equilibrium near data — unifying flow matching and EBMs in one objective. Equilibrium Matching (Wang et al., 2025) takes this further: time-invariant energy gradient, Adam-style sampling, FID 1.90 on ImageNet 256². → See ebm-explained.md.
6. Autoregressive ↔ Energy-Based bijection (2026)¤
Wang et al., 2026 — ARMs are Secretly EBMs establishes an explicit bijection between autoregressive LMs and EBMs in function space, corresponding to a special case of the soft Bellman equation in maximum-entropy reinforcement learning. The practical consequence: RLHF and DPO are best understood as energy shaping. → See ebm-explained.md and autoregressive-explained.md.
7. Diffusion = Hierarchical VAE¤
DDPM is mathematically a hierarchical VAE with \(T\) latent levels and a fixed Gaussian forward process (Luo, 2022 — Understanding Diffusion Models). The "ELBO derivation" of the DDPM loss in the diffusion explainer makes this exact.
8. AR ↔ Diffusion (text and images)¤
Discrete diffusion language models — LLaDA (Nie et al., 2025), Mercury (Inception, 2025) — are competitive in selected 8B-scale language and coding settings while decoding many tokens in parallel, but the broad AR-vs-diffusion language comparison is still early. Conversely, VAR (Tian et al., 2024 — NeurIPS Best Paper) brings GPT-style next-scale autoregression to image generation, beating its DiT baselines on ImageNet 256². Generator Matching (Holderrieth et al., 2024) gives a single objective subsuming flow / score / discrete-flow training.
The 2026 picture is best summarised as:
A useful 2026 mental model: choose a representation, choose a corruption / transport path, choose the target your network predicts, then choose a sampler and post-training procedure. The historical family name is often shorthand for those choices, not a completely separate universe.
Conditional Generation: How Each Family Handles Control¤
A pure generative model samples from \(p_\theta(x)\). Almost every useful deployed system samples from \(p_\theta(x \mid c)\) — conditioned on text, class, image, audio, or arbitrary control signals. The mechanisms differ across families and reveal a deeper unification:
| Family | Native conditioning | Test-time control trick |
|---|---|---|
| AR | Prepend \(c\) to the input sequence; train on \(p_\theta(x_t \mid c, x_{<t})\) | Prompting; classifier-free guidance (Sanchez et al., 2024); CCA (Toward Guidance-Free AR Visual Generation) |
| VAE | Conditional VAE (Sohn et al., 2015): \(p(x \mid z, c)\) and \(q(z \mid x, c)\) | Latent-space attribute editing; FiLM modulation |
| Diffusion | Train \(\epsilon_\theta(x_t, t, c)\) with cross-attention to \(c\) | Classifier-free guidance (Ho & Salimans, 2022); CFG-Zero* (Fan et al., 2025); ControlNet, IP-Adapter |
| Flow Matching | Train \(v_\theta(x_t, t, c)\) in the same way as conditional diffusion | CFG, CFG-Zero* (transferred from diffusion) |
| EBM | \(E_\theta(x, c)\) as a single energy network | Energy + reward shaping (cf. RLHF); JEM joint formulation |
| GAN | Concatenate \(c\) at input or condition the discriminator (cGAN, Pix2Pix) | Truncation, latent arithmetic |
| Classical Flow | Concatenate \(c\) to coupling networks; auxiliary classifier head | Limited; this is a known weakness of classical flows |
Three Universal Ideas¤
Three conditioning mechanisms have spread across all families:
- Classifier guidance (Dhariwal & Nichol, 2021) — push samples toward higher \(p_\phi(c\mid x)\) via the gradient of an auxiliary classifier. Originally for diffusion; now also used for AR and EBMs (where it's essentially negative-energy reward shaping).
- Classifier-free guidance (CFG) (Ho & Salimans, 2022) — interpolate between conditional and unconditional model outputs. Originally for diffusion; now also for AR visual generation (Sanchez et al., 2024) and flow matching, with CFG-Zero* (Fan et al., 2025) as the modern variant. The fact that CFG works equally well in AR and diffusion is a strong argument for the Modern Unifications: once you write both as denoising-of-a-corruption, CFG becomes a single technique with a single derivation.
- Adapter-based conditioning — ControlNet (Zhang et al., 2023) for diffusion, LoRA (Hu et al., 2021) for AR, IP-Adapter / T2I-Adapter for diffusion. The pattern: keep the base model frozen and inject conditioning through small trainable side-modules. This is now the dominant deployment recipe for any conditioning that wasn't in the pre-training distribution.
CFG: A Cross-Family Technique¤
CFG started as a diffusion-specific trick but turned out to be remarkably general. The unified statement is:
where \(o\) is whatever the network outputs — noise (diffusion), velocity (flow matching), logits (AR), or score. This is the same algebra in every family. Recent results (Sanchez et al., 2024; EditAR — CVPR 2025) extend CFG to AR visual generation and image editing under a unified framework.
The open theoretical question (cf. Open Problems → Why does CFG work?): CFG samples from a distribution that's neither \(p_\theta(x \mid c)\) nor \(p_{\mathrm{data}}(x \mid c)\) — it's an interpolation whose target divergence isn't formally characterised. The fact that it improves perceptual quality (and CLIP score) while increasing divergence from the true conditional is one of the deepest unresolved puzzles in the field.
Failure Modes Across Families¤
Each family has characteristic failure modes that practitioners spend most of their debugging time on. The cross-family view reveals deep parallels:
| Failure | Family | Symptom | Mechanism | Fix |
|---|---|---|---|---|
| Mode collapse | GAN | Generator outputs few distinct samples | Reverse-KL pulls mass away from low-density regions | WGAN; mini-batch discrimination; PacGAN; spectral norm; modern recipes (R3GAN) almost eliminate it |
| Posterior collapse | VAE | Decoder ignores \(z\); KL = 0 | Strong decoder makes \(z\) uninformative | KL annealing; cyclical β-schedule; free-bits; VQ-VAE (discrete latents) |
| Exposure bias | AR | Test-time errors compound | Trained on ground-truth context, sampled from own context | Scheduled sampling; sufficient-scale-and-data is the practical fix |
| Partition-function bias | EBM (CD) | Short-chain MCMC under-estimates \(\nabla \log Z\) | Negative samples not from \(p_\theta\) | Persistent CD; replay buffers; switch to NCE / score matching / Energy Matching |
| Energy explosion | EBM | Energies grow unbounded; NaNs | No regulariser on \(E\) | Add \(\alpha \mathbb{E}[E^2]\); gradient clipping; spectral norm |
| Likelihood ≠ quality | Flow / EBM / VAE | High likelihood, ugly samples (or vice versa) | Likelihood is not a perceptual metric | Combine with perceptual losses; use a tokenizer + downstream prior |
| Higher likelihood on OOD | Flow / VAE | Pathological assignment of high \(p_\theta\) to non-data | Subtle training-set bias; not an algorithmic bug | Likelihood-ratio scoring; second-model normalisation (Nalisnick et al. 2019) |
| Schedule mismatch | Diffusion | Last-step blur or first-step incoherence | Sub-optimal noise schedule for the data distribution | Cosine / shifted / EDM schedules; per-resolution tuning |
| CFG over-saturation | Diffusion | High-CFG samples are over-saturated, oversmooth | CFG pushes off-manifold | CFG-Zero*; capping the guidance scale; dynamic thresholding |
| Bias amplification | All | Outputs amplify biases in training data | Generative models are statistical mirrors | Data curation; debiasing fine-tunes; reward-model post-training |
| Recursive model collapse | AR / Diffusion (any) | Quality declines when trained on prior model's outputs | Iterated narrowing of the support of generated data | Mix in real data (Shumailov et al., 2024 — Nature); synthetic-data verification (2025) |
| Hallucination | AR LLMs | Confident wrong outputs | No mechanism distinguishing memorised facts from interpolations | RAG; verifier-based RLVR; calibration training |
| Privacy leakage | All | Model regurgitates training examples | High capacity + low data diversity | Differential privacy; deduplication; data unlearning |
The deeper observation: almost every failure mode in column 2 has a structural analogue in another family. Mode collapse (GAN) is driven by adversarial support mismatch, critic pressure, and update dynamics; posterior collapse (VAE) is KL-pressure-driven; recursive model collapse (any family) is a feedback-loop pathology — but they all reduce to insufficient pressure on \(p_\theta\) to keep its support equal to \(p_{\mathrm{data}}\)'s. A frontier research thread: a single regularisation that addresses all of them simultaneously.
The Platonic Representation Hypothesis¤
A startling 2024–2025 empirical observation: representations across different model families often become more aligned as scale and task diversity increase. Models trained with different objectives — CLIP-style contrastive vision-language encoders, autoregressive LLMs, diffusion / VAE visual codecs, and DINOv2-style self-supervised encoders — can end up measuring neighbourhoods between datapoints in surprisingly similar ways.
This is the Platonic Representation Hypothesis (Huh, Cheung, Wang & Isola, 2024): sufficiently capable models may converge toward a shared statistical model of the underlying reality. Three driving forces are proposed:
- Task generality — models trained on more diverse tasks are forced into representations that capture more of the underlying structure.
- Model capacity — larger models are more likely to find these globally-optimal representations.
- Simplicity bias — neural networks naturally prefer simpler solutions that generalise.
Why This Matters for Generative Modelling¤
If the hypothesis holds even approximately, the choice of family matters less than the field's history suggests:
- A diffusion DiT and an AR transformer trained on the same data end up with qualitatively similar internal feature manifolds.
- REPA (Yu et al., 2024 — ICLR'25 Oral) exploits this directly — aligning DiT activations with frozen DINOv2 features yields 17.5× faster training with no quality loss. The "right" representation already exists; you just have to inherit it.
- VFM-tokenizers (2025) take this to its logical conclusion: replace the trained VAE encoder with a frozen DINOv2 / SigLIP, train only a lightweight decoder. The codec inherits the Platonic representation.
- 2025 follow-ups extend the hypothesis to molecular ML (blopig 2026), astronomy (The Platonic Universe, 2025), and neural decoding (Brain–LM Alignment, NeurIPS 2025 UniReps) — evidence that representation alignment is not confined to language and vision.
- A 2026 critique, Revisiting the Platonic Representation Hypothesis: An Aristotelian View, shows that some global similarity metrics are confounded by model scale. After calibration, the surviving signal is more local: models agree on neighbourhood structure more reliably than on a single universal global geometry.
The implication for novel research: representational structure is a powerful prior, but it is not magic. A research direction that reuses strong representations across generative-model families (AR planner with diffusion-tokenized world; flow-matching decoder hung on a frozen VFM / LLM encoder) is likely under-explored, but it still needs calibrated evidence that the transferred representation preserves the task-relevant neighbourhoods.
Scaling Laws Across Families¤
Each family has its own scaling law for how loss decreases with compute, parameters, and data. Comparing them reveals where each family has head-room left.
Autoregressive: Chinchilla and Beyond¤
Hoffmann et al., 2022 — Chinchilla showed that for compute-optimal training, model size and training tokens should scale equally — for every doubling of model size, double the training tokens. This corrected Kaplan-style "scale parameters first" recipes that had over-trained Gopher-class models.
2024–2026 refinements:
- Beyond Chinchilla-Optimal (2024) — when you expect ≥1B inference requests, train longer than Chinchilla and smaller, paying compute upfront for inference savings.
- LFM2.5 (Liquid AI, 2026) sets a 80,000:1 token-to-parameter record (350M parameters trained on 28T tokens), demonstrating that the diminishing-returns curve has not yet saturated.
- Test-time compute scaling (Snell et al., 2024; OpenAI o1, 2024) introduces an entirely new axis: throwing inference compute at hard problems is often more cost-effective than scaling parameters. This is sometimes called the "missing piece of the Bitter Lesson" — Rich Sutton's 2019 essay argued that general methods that leverage compute are the only ones that scale; until 2024 we had this only for training compute. Test-time compute completes the picture.
Diffusion: EDM and EDM2¤
Karras et al., 2022 — EDM and Karras et al., 2024 — EDM2 gave the diffusion field its first clean compute-optimal scaling laws. Notably post-hoc EMA allows reading off many EMA-decay parameters without re-training, and EDM2's magnitude-preserving design hits FID 1.81 on ImageNet 512² — a record that held for over a year.
REPA-style representation alignment (Yu et al., 2024) shifts the scaling curve outright: 17.5× faster SiT-XL training to matching FID. This is the diffusion analogue of "switching from Kaplan to Chinchilla scaling": the new compute-optimal curve passes through much lower compute budgets if you align with foundation-encoder representations.
Sparse / Mixture-of-Experts¤
MoE breaks the dense-scaling curve by letting parameters and compute scale separately. DeepSeek-V3 (671B total / 37B active) and Mixtral 8×7B demonstrate that on standard benchmarks, an MoE with \(N_{\mathrm{active}} \approx N_{\mathrm{dense}}/4\) matches dense quality at near-equivalent inference cost. The 2026 default flagship is "sparse-activated MoE Transformer" — see autoregressive-explained.md → MoE.
A Unified Scaling Picture¤
| Family | Compute axis | Data axis | Test-time axis |
|---|---|---|---|
| AR | Chinchilla-optimal \(N \times T\) | 28T+ tokens (LFM2.5) | CoT length, sample-and-verify (o1, R1, o3) |
| Diffusion | EDM2 / REPA accelerate the curve | ImageNet → LAION 5B → web-scale video | Distillation steps (1–4) |
| Flow matching | Mean Flow / TarFlow shift the curve to 1 NFE | Same as diffusion | Reflow iterations |
| EBM | Energy Matching / EqM finally got a stable curve | Smaller-scale (CIFAR / ImageNet) than AR | Adaptive sampling steps |
| GAN | R3GAN simplifies; GigaGAN scales to 1B params | LAION-class | N/A (one NFE) |
The deeper observation: in 2026 every family has discovered a test-time compute axis. Diffusion has its sampler step count, AR has chain-of-thought, EBMs have adaptive-optimiser sampling, flows have reflow iterations. The training-compute axis is no longer the only knob — and across families this new axis often has steeper slopes than the old one.
The Evaluation Crisis¤
A quietly serious problem: the metrics we use to evaluate generative models are broken, and increasingly so as the models improve.
FID Doesn't Match Human Perception¤
Stein et al., 2023 — Exposing Flaws of Generative Model Evaluation Metrics ran the largest-ever psychophysics experiment on generative-model outputs and found that no existing automatic metric (FID, IS, CLIP-FID, MSE, …) strongly correlates with human judgement of perceptual realism. The state-of-the-art perceptual realism that diffusion models reach is not reflected in FID rankings. Worse, FID systematically under-rates diffusion vs. older GAN baselines — partly because of over-reliance on InceptionV3 features that diffusion models don't optimise for.
Replacement Metrics in 2024–2026¤
- CMMD (Jayasumana et al., 2024) — CLIP-feature MMD; better human correlation, sample-efficient.
- DINOv2-FID (Stein et al., 2023; REPA, 2024) — replace InceptionV3 with DINOv2 features.
- Human-preference-aligned metrics — OneReward and the new generation of preference-trained reward models (analogous to RLHF reward models for LLMs).
- Open benchmarks — Awesome-Evaluation-of-Visual-Generation tracks the proliferation.
The Same Crisis in Language¤
For autoregressive LLMs, perplexity has decoupled from capability. A 70B Llama-3 with similar perplexity to GPT-4-class models can be wildly worse on MMLU-Pro, GPQA Diamond, FrontierMath, or LiveCodeBench. The 2024–2026 evaluation stack has therefore fragmented into specialised benchmarks (see autoregressive-explained.md → Evaluation Metrics) and LMSYS Chatbot Arena's pairwise human preferences.
What This Implies for Research¤
If our metrics are broken, most reported "X beats Y" results need to be recalibrated. Concrete consequences:
- The 2024 VAR paper's "AR beats diffusion" claim is FID-based. With CMMD or human evaluation the comparison shifts.
- Post-training results are now reported in terms of human-preference arena ratings instead of loss — this is a fundamental shift in how empirical claims are validated.
- Density-estimation comparisons (BPD, NLL) are still trustworthy and remain the only honest way to compare classical flows, EBMs, and AR models on tabular / scientific data.
A frontier research direction: principled, model-class-agnostic evaluation. Anything that gets us closer to a metric that (a) correlates with human judgement, (b) doesn't favour models that share the metric's feature extractor, and © is sample-efficient is high-impact.
Mechanistic Interpretability of Generative Models¤
A 2024–2026 thread that may end up reshaping the entire field: figuring out what generative models actually compute internally. Once foreign to the diffusion / flow / GAN literature, mechanistic interpretability has rapidly matured.
Key Findings¤
- Universality of circuits (Olsson et al., 2022 — induction heads; Bridging the Black Box, ACM CSUR 2025) — the same algorithmic motifs (induction heads, copy heads, summary heads) appear across LLMs of different sizes and seeds. Same architecture → same circuits.
- Diffusion semantic layout emerges early (Park et al., 2024 — Emergence and Evolution of Interpretable Concepts in Diffusion Models) — the global layout of a generated image is determined within the first reverse-diffusion step, even though the visual content is still pure noise. This is consistent with the Platonic Representation Hypothesis: the model commits to a semantic representation early and refines it perceptually.
- Multi-hop reasoning has visible internal scratchpads — Anthropic's 2024 work on Claude shows the model forming intermediate latent representations (e.g. "Dallas → Texas" before answering "what's the capital of Texas?") that match the structure of an explicit chain-of-thought. This connects directly to the test-time compute scaling story.
- Open Problems in Mechanistic Interpretability (Jan 2025) — a 29-author consensus paper enumerated the field's current open problems: rigorous definitions of "feature", computational complexity barriers, practical methods that underperform simple baselines on safety-relevant tasks.
Why It Matters for Generative Modelling¤
Mechanistic interpretability gives us the first principled way to answer questions that have driven the field for a decade:
- Why do GANs mode-collapse? (Discriminator over-trains on recently-generated examples; circuit-level analysis would make this concrete.)
- Why does CFG work? Are guidance scales injecting attention or shifting energies?
- Why does test-time compute help reasoning models? Is the chain-of-thought constructing circuits at inference time or invoking pre-trained ones?
- What exactly is the Platonic representation and why does scale converge to it?
These are the questions where the next decade of generative-model theory will probably live.
Reasoning as Generation¤
A 2024–2026 reframing with deep implications: reasoning is generation, and the boundary between them is dissolving.
OpenAI's o1, DeepSeek-R1, and successors (Snell et al., 2024) have shown that scaling test-time compute — letting an autoregressive model produce arbitrarily long chains of thought — produces capabilities that no amount of train-time scaling reaches. AIME 2024 went from GPT-4o's 12 % to o1's 74 % at single-sample, 93 % at re-ranked-1000-samples.
Three takeaways from the cross-family lens:
- Reasoning is a generative process under verification. o1's RL with Verifiable Rewards (RLVR) is mathematically a reward-shaped policy gradient — equivalent to training an EBM whose energy is the negative reward under the ARM ↔ EBM bijection (Wang et al. 2026).
- Diffusion-style trajectories generalise to reasoning. A 50-step DDPM denoising trajectory and an o1 chain-of-thought are both iterative refinement processes that trade compute for quality. A research direction: can a diffusion model "reason" in latent space the way an LLM reasons in token space?
- World models = visual reasoning. The 2026 Visual Generation in the New Era survey identifies the migration from atomic generation to agentic world modelling as the dominant trend. Video diffusion models that can be queried for "what happens if this object moves left?" are doing visual chain-of-thought. See the next section.
World Models: Generation as the Substrate of Cognition¤
The 2026 frontier is world modelling — generative models that don't just produce data but maintain an internal simulation of the world that can be queried, updated, and acted upon.
The Three-Level Taxonomy¤
Agentic World Modeling (2026) defines:
- L1 Predictor — one-step local transitions (most current diffusion video models)
- L2 Simulator — multi-step action-conditioned rollouts (Sora, Genie, Cosmos)
- L3 Evolver — autonomous model revision and self-improvement (the frontier, mostly aspirational)
Sequential Decoupling vs Unified Coupling¤
Visual Generation in the New Era (2026) distinguishes two architectural philosophies:
- Sequential Decoupling — an LMM "planner" emits plan tokens (text), and a video diffusion model "renders" them into pixels. Combines AR's instruction-following with diffusion's per-pixel fidelity.
- Unified Coupling — causal reasoning and generative power are optimised jointly in a single network. PAN (Physical, Agentic, Nested, 2026) is a representative early example, mixing continuous and discrete representations under a hierarchical generative model.
Why This Closes the Loop¤
World models put the six families into their final unified context: a generative model is a world model when its samples are the substrate of an agent's planning loop. That requires:
- A fast generative core (diffusion-distilled / mean-flow / flow-matching DiT — the speed-of-thought axis)
- A causal prior over actions and outcomes (AR — the planning axis)
- A learned scoring function over outcomes (EBM — the evaluation axis)
- A codec compressing the world into a tractable representation (VAE — the perception axis)
- An adversarial post-training step aligning to human preferences (GAN — the alignment axis)
- An invertible representation supporting counterfactuals (Flow — the editing axis)
In other words: a competent world model is likely to reuse ideas from all six families. The deep-dive pages in this directory are therefore not six mutually exclusive choices; they are six design vocabularies that often become components of a single frontier system.
Choosing a Family in 2026: Decision Guide¤
graph TD
Q{{"What's the<br/>primary goal?"}}
Q -->|"Density estimation,<br/>OOD, compression"| L{{"Tractable<br/>likelihood needed?"}}
Q -->|"High-quality generation"| G{{"Modality?"}}
Q -->|"Real-time<br/>(under 50 ms / sample)"| R{{"Pretrained model<br/>available?"}}
Q -->|"Hybrid generative-discriminative<br/>(joint p(x,y))"| H["JEM / EBM<br/>(joint formulation)"]
L -->|"Exact"| LCF["Classical Flow<br/>(NICE, RealNVP, NSF)<br/>or Autoregressive"]
L -->|"Approximate fine"| LCV["VAE (ELBO) or<br/>Diffusion (PF-ODE)"]
G -->|"Image / video<br/>(text-conditioned)"| GIM["Latent Diffusion DiT<br/>(SD3 / FLUX) +<br/>VAE codec"]
G -->|"Text"| GTX["AR Transformer (Llama-3, DeepSeek-V3)<br/>or Diffusion LM (LLaDA, Mercury)"]
G -->|"Audio"| GAU["AR (WaveNet / Mamba)<br/>or Latent Flow Matching"]
G -->|"Molecules / proteins"| GMO["Equivariant Flow Matching<br/>(SE(3)-FM, FlowMM)"]
G -->|"Discrete graphs"| GDG["DFM<br/>(Discrete Flow Matching)"]
R -->|"Yes (e.g., SDXL teacher)"| RD["Adversarial Diffusion<br/>Distillation (ADD, DMD2)<br/>→ 1–4 NFE student"]
R -->|"No (training from scratch)"| RM["Mean Flow / TarFlow<br/>(1 NFE, no distillation)"]
style Q fill:#fff9c4
style L fill:#fff3cd
style G fill:#fff3cd
style R fill:#fff3cd
style H fill:#e8f5e9
style LCF fill:#e8f5e9
style LCV fill:#e8f5e9
style GIM fill:#e8eaf6
style GTX fill:#fff3cd
style GAU fill:#f3e5f5
style GMO fill:#f3e5f5
style GDG fill:#f3e5f5
style RD fill:#ffebee
style RM fill:#f3e5f5
Hybrid Pipelines in 2026¤
Almost every state-of-the-art generative system shipped between 2024 and 2026 is a hybrid combining at least three of the six families. Three production patterns dominate:
Pattern 1: Image / Video Generation (SD3, FLUX, Sora, HunyuanVideo, Wan)¤
graph TD
P["Pixels / video"] --> V["VAE encoder<br/>(perceptual codec)"]
V --> Z["Latent z"]
Z --> FM["Flow-matching DiT<br/>(MMDiT / Hunyuan-DiT)"]
T["Text prompt"] --> CLIP["Text encoder<br/>(CLIP / T5)"]
CLIP --> FM
FM --> Z2["Denoised z'"]
Z2 --> VD["VAE decoder"]
VD --> Out["Generated pixels / video"]
FM -.->|"optional<br/>adversarial distillation"| GD["GAN discriminator<br/>(FLUX schnell, SDXL Turbo)"]
style V fill:#e8f5e9
style FM fill:#e8eaf6
style GD fill:#ffebee
style VD fill:#e8f5e9
This single diagram involves VAE / autoencoder (codec), Flow Matching (training objective), Diffusion (inference is a denoising / transport trajectory), AR or encoder-only text models (prompt representation), and optionally GAN (adversarial distillation).
Pattern 2: Text Generation (GPT-4, Llama-3, DeepSeek-V3, R1)¤
graph TD
Tok["Tokens"] --> AR["Autoregressive Transformer<br/>(MoE + GQA/MLA + RoPE +<br/>RMSNorm + SwiGLU)"]
AR --> Out["Next token"]
AR -.->|"Speculative<br/>decoding"| Draft["Draft model<br/>(MTP head)"]
Draft -.-> AR
AR -.->|"RLHF / DPO / RLVR<br/>(o1 / R1 reasoning)"| RL["Energy-shaped<br/>policy"]
style AR fill:#fff3cd
style RL fill:#e1f5ff
style Draft fill:#f3e5f5
The post-training step is now formally understood as EBM-style energy shaping (Wang et al., 2026).
Pattern 3: Scientific / Multimodal (Protein Design, Materials, Multimodal World Models)¤
graph TD
D["Domain-specific<br/>data (proteins,<br/>crystals, …)"] --> RFM["Riemannian Flow Matching<br/>(SE(3) / Lie-group<br/>equivariant)"]
RFM --> Sample["Generated<br/>molecule / structure"]
Sample -.-> Filter["Energy-based<br/>filter / verifier"]
Filter -.-> Sample
Multimodal["Multimodal AR token<br/>(GPT-4o / Llama-4)"] --> Plan["Plan tokens"]
Plan --> RFM
RFM --> Render["Rendered output"]
style RFM fill:#f3e5f5
style Filter fill:#e1f5ff
style Multimodal fill:#fff3cd
This is the Pattern 1 + EBM filter + AR planner combination that the 2025–2026 Visual-Generation surveys (Visual Generation in the New Era, 2026) identify as the agentic / world-modelling frontier.
Open Theoretical Problems¤
Eight problems where the cross-family view exposes deep gaps the field has not yet closed. These are not engineering wishlists; they are conceptual open questions where progress would shift the field.
1. A principled choice of corruption process¤
Diffusion uses Gaussian noising; discrete-flow uses CTMC; AR uses left-to-right erasure; VAR uses scale-coarsening. We have no theory predicting which corruption is optimal for a given data manifold. A theoretical answer would unify the families on a single axis: "given \(p_{\mathrm{data}}\), the best corruption is …".
2. The exact rate of "diffusion = flow matching"¤
We have strong equivalence and near-equivalence results for diffusion and Gaussian-path flow matching, but theoretical KL bounds (NeurIPS 2024) only tighten asymptotically. The exact non-asymptotic gap as a function of timestep schedule, target parameterisation, network capacity, and data dimension is open.
3. The partition function for high-dimensional images¤
Despite Energy Matching and EqM bypassing it, no one has yet produced a tight estimate of \(\log Z\) for an EBM trained on natural images. AIS scales poorly past CIFAR. This is the single biggest empirical gap holding EBMs back from being directly comparable on likelihood.
4. Why does CFG work?¤
Classifier-free guidance is the most-used technique in production diffusion, but the formal explanation is unsatisfying — it doesn't sample from any tractable distribution and the "effective temperature" of the resulting distribution is not characterised. Karras et al., 2024 made progress; the exact divergence minimised remains open.
5. The relationship between scaling laws and emergence¤
We have Chinchilla curves for pre-training loss and we have (separate) test-time compute curves for reasoning. There is no theory connecting the two — no current framework predicts which capabilities emerge from train-time scaling vs which require test-time compute. This is arguably the single most important open question in 2026 LLM research.
6. Sample efficiency on the data manifold¤
Why does flow matching need orders of magnitude fewer gradient steps than denoising-score-matching diffusion at matching quality? The Mean Flows (Geng et al., 2025) result reaches FID 3.43 in 1 NFE without distillation — a fact the field does not yet have a clean theoretical explanation for.
7. The Platonic Representation Hypothesis: why does it converge?¤
Huh et al., 2024 gives empirical evidence and informal arguments. A formal theory predicting which representations different families converge to (and which don't) would be transformative — and would tell us where novel research is genuinely necessary vs where it's redundant with Platonic-converged structure.
8. Generation–reasoning unification¤
Wang et al., 2026 connects ARMs to EBMs via maximum-entropy RL, but the connection between iterative refinement (diffusion sampling, MCMC, gradient descent on energy) and iterative reasoning (chain-of-thought, search, planning) is still informal. A unified theory would explain why o1 / R1 / o3 work and predict where they will fail.
Frontier Research Directions: Where to Contribute¤
Concrete novel-research seeds that fall directly out of the cross-family view. Each is plausibly tractable for a strong PhD project or a focused industry team in 2026.
A. Under-Explored Hybrids¤
- AR-encoder + diffusion-decoder for molecules/proteins. The protein-design literature has largely ignored AR planners; a model that emits a sequence of molecular tokens (residues, fragments) and renders them with an SE(3)-equivariant flow-matching decoder could combine the strengths of ProtGPT2 with RFdiffusion.
- Energy-as-reward for distillation. ADD and DMD2 use a discriminator. EqM-style energy gradients could replace the discriminator with a stationary energy landscape — providing a single time-invariant target for fast inference.
- Flow matching in token space. Discrete flow matching (Gat et al., 2024) has matured but remains under-applied to general sequence problems. A flow-matching alternative to RLHF / DPO is plausible.
B. Theoretical Gaps to Fill¤
- Optimal corruption for a given manifold. Empirical or theoretical work tying intrinsic-dimension estimates of \(p_{\mathrm{data}}\) to optimal noise schedules.
- A no-free-lunch theorem for generative trilemmas. We've broken every corner of the trilemma individually; can we prove you cannot break all three at once at a given parameter budget?
- Scaling-to-emergence connector. A predictive theory of which capabilities emerge from train-time vs test-time compute.
C. Evaluation¤
- Family-class-agnostic metrics. Metrics that don't rely on InceptionV3 / DINOv2 / CLIP features (so they don't favour models that share the feature extractor) and that correlate with human judgement. The current state of the art is bad — see The Evaluation Crisis.
- Likelihood-free density estimation for OOD detection that doesn't fall victim to the higher-likelihood-on-OOD pathology (Nalisnick et al., 2019). This is unexpectedly hard despite a decade of attention.
D. World Models¤
- L3 Evolvers — autonomous self-improving world models. L1 (predictor) and L2 (simulator) are essentially solved by 2026; L3 is the open frontier.
- Hybrid AR-planner / diffusion-renderer with shared representation. Right now the planner and renderer have separate internal representations, communicating via tokens. A shared-representation architecture (cf. Platonic Representation Hypothesis) would be much more sample-efficient.
E. Applications Where No Family Yet Dominates¤
- Long-horizon video (10+ seconds with object permanence and physical consistency).
- Multi-agent generative simulations (multiple actors with private goals).
- Generative scientific verification — proposing novel experiments that an automated lab can run, with credit-assignment back through the generative model.
- Generative embodied models that take continuous actuator commands as part of the conditioning.
- Privacy-preserving generative models at scale — current DP-trained generative models trade off sharply with quality, leaving open whether any family can train on copyrighted / private data with rigorous guarantees.
F. Mechanistic Interpretability of Diffusion / Flow¤
The mechanistic-interpretability literature is overwhelmingly LLM-focused. Diffusion / flow models are equally susceptible to the same circuit-discovery and feature-extraction techniques, and almost no work has been done. This is a green-field area: probing what an SD3 MMDiT actually does internally during a denoising trajectory is a tractable PhD-scale project as of 2026.
Common Confusions and Cross-Family Pitfalls¤
Pedagogical traps that catch most practitioners when they first move between families. Each is a known mistake-pattern worth flagging up-front.
"Likelihood is the right metric for sample quality"¤
Reality: likelihood and perceptual quality can move in opposite directions. A flow model can have lower BPD than diffusion on a benchmark and still produce noticeably worse samples. Use the right metric for the right job — see The Evaluation Crisis.
"Diffusion is fundamentally slow because of MCMC"¤
Reality: diffusion sampling is not MCMC. Score-based diffusion uses the probability-flow ODE or reverse-time SDE; the iterations are deterministic (PF-ODE) or non-Markovian (DDIM) and have nothing to do with detailed balance. The "MCMC" association comes from EBMs and is misapplied to diffusion in many tutorials.
"GAN inputs are sampled from the same prior as VAE latents"¤
Reality: superficially yes — both sample \(z \sim \mathcal{N}(0, I)\) — but a VAE encoder gives you a meaningful posterior \(q(z \mid x)\) that lets you encode real data, whereas a GAN has no encoder. ALI / BiGAN add one but at significant cost, and they don't recover VAE-class representation quality.
"Flows have exact likelihood; that makes them better"¤
Reality: classical flows have exact likelihood under the chosen parameterisation, but the constraint that \(D_{\mathrm{model}} = D_{\mathrm{data}}\) wastes capacity on natural images (cf. the manifold lens). On natural images, the flow-matching family — which gives up exact likelihood — is empirically dominant. "Exact likelihood" is a feature only when likelihood is what you actually need (anomaly detection, scientific density estimation).
"Score matching = denoising"¤
Reality: at one particular noise level \(\sigma\), denoising score matching is equivalent to optimal denoising. But score matching across all \(\sigma\) does not reduce to a single denoising task; that's why DDPM trains a noise-conditional network rather than a single denoiser. Several pre-2020 papers conflated the two and arrived at suboptimal training schedules.
"Flow matching is a different thing from diffusion"¤
Reality: as of 2024, Diffusion Meets Flow Matching shows that Gaussian-path flow matching and DDPM-style diffusion can be written as closely related probability-path algorithms under different parameterisations. They differ in noise schedule, prediction target, stochastic-vs-ODE sampling, and whether the path is Gaussian, OT, rectified, or discrete. SD3 and FLUX.1 are flow-matching / rectified-flow at training time but use diffusion-style ODE solvers at inference.
"GAN's discriminator is just a classifier"¤
Reality: the discriminator is a learned divergence estimator. A vanilla GAN's discriminator approximates the JS divergence; a WGAN critic approximates the Wasserstein distance; an f-GAN's critic approximates an f-divergence. Treating it as "just a binary classifier" misses the deepest insight in the GAN literature.
"AR models can't be trained in parallel"¤
Reality: AR models training is fully parallel via teacher forcing — the entire sequence is processed in one forward pass with causal masking. Sampling is sequential, but multi-token prediction and speculative decoding close most of this gap in 2024–2026.
"EBMs are obsolete"¤
Reality: pre-2024 this was a reasonable practical view, but the Modern Unifications → score-based diffusion as time-indexed EBMs arrow shows that score-based diffusion can be read as a time-indexed energy model when the learned field is a true score. Energy Matching, Equilibrium Matching, ARMs-as-EBMs restored EBMs to the centre of generative-model research between 2024 and 2026.
"FID is the standard metric for image generation"¤
Reality: FID is historically the standard but has been shown not to correlate strongly with human judgement, especially for diffusion-class models (cf. Stein et al., 2023). Use CMMD, DINOv2-FID, or human-preference scores for 2026 evaluations.
Glossary of Common Notation¤
| Symbol | Used in | Meaning |
|---|---|---|
| \(x \sim p_{\mathrm{data}}\) | all | A sample from the true (unknown) data distribution |
| \(p_\theta(x)\) | all | Model distribution with parameters \(\theta\) |
| \(z\) | VAE, GAN, flows | Latent variable; usually \(z \sim \mathcal{N}(0, I)\) |
| \(x_t\) | diffusion, flow | Noisy / interpolated sample at time \(t \in [0, T]\) or \([0, 1]\) |
| \(\epsilon\) | diffusion | Standard-normal noise added to \(x_0\) to obtain \(x_t\) |
| \(E_\theta(x)\) | EBM, JEM | Scalar energy function; \(p_\theta(x) \propto e^{-E_\theta(x)}\) |
| \(Z\) or \(Z_\theta\) | EBM | Partition function \(\int e^{-E_\theta(x)} dx\) |
| \(\nabla_x \log p_t(x)\) | diffusion, EBM | Score function (gradient of log-density in \(x\)) |
| \(v_\theta(x, t)\) | flow matching | Vector field along the interpolant |
| \(\alpha_t, \bar\alpha_t, \beta_t\) | diffusion | Variance-schedule coefficients (DDPM convention) |
| \(\sigma_t\) | EDM diffusion | Continuous-time noise scale (variance-exploding convention) |
| \(D(\cdot), G(\cdot)\) | GAN | Discriminator / Generator networks |
| \(f_\theta, f^{-1}_\theta\) | flows | Forward / inverse invertible transformation |
| \(\mathrm{NFE}\) | diffusion, flow | Number of network forward evaluations per sample |
| \(T\) | AR, diffusion | Sequence length (AR) or number of diffusion steps |
| \(\mathrm{ELBO}\) | VAE, diffusion | Evidence Lower Bound — tractable lower bound on \(\log p(x)\) |
| \(\mathrm{NLL}\) | all | Negative log-likelihood |
| \(\mathrm{FID}\) | all (eval) | Fréchet Inception Distance — sample-quality metric |
| \(\mathrm{BPD}\) | flows, AR | Bits per dimension — likelihood normalised by data dimension |
| \(\mathrm{NFE}\) | inference | Number of network function evaluations |
Where the Field Is Going (2026)¤
Five trends are reshaping the practical and theoretical landscape:
- Test-time compute as a first-class scaling axis — o1, R1, o3, and DeepSeek-R1 have shown that throwing inference compute at hard problems is often more cost-effective than scaling parameters. This applies across families: distilled diffusion's 1–4 NFEs is the minimum compute axis; reasoning LLMs' chain-of-thought is the maximum. Expect more systems to expose a tunable quality / latency / reasoning knob. → See autoregressive-explained.md.
- Hybrid architectures everywhere — pure-Transformer LLMs are increasingly paired with Mamba/attention hybrids, MoE routing, retrieval, verification, and tool-use systems; visual generators are paired with codecs, flow / diffusion objectives, and sometimes adversarial distillation; text models gain diffusion-style parallel decoding (LLaDA, Mercury). The 2026 default model is a stack, not a single family.
- Mathematical unification continues — Energy Matching, Equilibrium Matching, ARMs-as-EBMs, and Generator Matching (Holderrieth et al., 2024) are all 2024–2026 results showing that the historical "six families" share more structure than their names imply. Expect more unifying objectives, but treat "master objective" claims cautiously until they preserve likelihoods, samples, conditioning, and finite-compute behaviour at the same time.
- Multi-component foundation systems — the 2026 production AI system is no longer a monolithic LLM but a system of generation + verification + safety + memory + retrieval components. The boundaries between "generative model" and "agentic system" are blurring. → See Hybrid Pipelines.
- Representations are converging — the Platonic Representation Hypothesis suggests that scale is driving disparate models toward a shared internal representation. Whether this is true universally is the single most consequential empirical question of the next 5 years; if it is, the optimal research strategy is to inherit foundation representations and only train what's differentiated.
Further Reading: Cross-Family Surveys and Frontier Work¤
| Resource | Year | Scope |
|---|---|---|
| Understanding Diffusion Models: A Unified Perspective (Calvin Luo) | 2022 | Connects ELBO, score matching, and SDEs in a single derivation |
| Hitchhiker's Guide on the Relation of EBMs with Other Generative Models | 2024–2025 | EBMs as the bedrock; comprehensive cross-family review |
| Deep Generative Modelling: A Comparative Review (Bond-Taylor et al.) | 2021 | Foundational comparative survey of VAEs, GANs, flows, EBMs, AR |
| Diffusion Meets Flow Matching | 2024 | Online resource showing diffusion ≡ Gaussian flow matching |
| Generator Matching (Holderrieth et al.) | 2024 | Single objective subsuming flow / score / discrete-flow training |
| The Platonic Representation Hypothesis (Huh et al.) | 2024 | Why model representations are converging |
| Visual Generation in the New Era | 2026 | Atomic mapping → conditional → in-context → agentic → world modelling |
| Agentic World Modeling: Foundations, Capabilities, Laws | 2026 | L1 / L2 / L3 taxonomy of world models |
| State of LLMs 2025 (Raschka) | 2025 | Year-end retrospective with 2026 predictions |
| Generative AI in 2026: 7 Research Breakthroughs | 2026 | Forward-looking predictions |
| Survey on the Role of Mechanistic Interpretability in Generative AI | 2025 | Cross-family interpretability |
| Bridging the Black Box: A Survey on Mechanistic Interpretability | 2025 ACM CSUR | Universality of circuits, scaling laws of interpretability |
| Open Problems in Mechanistic Interpretability (Jan 2025) | 2025 | 29-author consensus on open problems |
For family-specific surveys, see the Surveys subsection at the end of each deep-dive.
Next Steps¤
-
Start with Autoregressive Models
The simplest factorisation; cleanest example of tractable likelihood
-
Or jump to Diffusion Models
The largest deep-dive; SOTA on images, video, audio, and 3-D
-
Practical model guides
Hands-on guides for VAE, GAN, Diffusion, Flow, EBM, and Autoregressive in Artifex
-
Frontier research seeds
See the Open Theoretical Problems and Frontier Research Directions sections above for concrete novel-research opportunities