Normalizing Flows Explained¤

Exact Likelihood

Compute exact log-likelihood through tractable Jacobian determinants, enabling precise density estimation
Bijective Transformations

Invertible mappings allow both efficient sampling and exact inference through forward and inverse passes
Flexible Distributions

Transform simple base distributions into complex target distributions through learned compositions
Fast Generation

Single-pass or few-step sampling with modern architectures achieving real-time performance

New here?

For a one-page map of how flows fit next to VAEs, GANs, Diffusion, EBMs, and Autoregressive models, start with Generative Models — A Unified View. This page is the deep-dive on normalizing flows and flow matching specifically.

Overview¤

Normalizing flows, introduced by Rezende & Mohamed (2015) and Dinh et al. (2014 — NICE), are a class of generative models that provide exact likelihood computation and efficient sampling through invertible transformations. Unlike VAEs that optimize approximate lower bounds or GANs that learn implicit distributions, flows transform simple base distributions into complex data distributions via learned bijective mappings with tractable Jacobian determinants. The 2022–2026 wave of flow matching (Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023) has unified flows with diffusion models and put them at the core of every modern large-scale image and video generator (Stable Diffusion 3, FLUX.1, HunyuanVideo, Wan 2.x).

What makes normalizing flows special? Flows solve a fundamental challenge in generative modeling: simultaneously enabling precise density estimation and efficient sampling. By learning invertible transformations with structured Jacobians, classical flows (NICE, RealNVP, Glow, NSF):

Compute exact likelihood for any data point without approximation
Generate samples through fast inverse transformations
Perform exact inference without variational bounds or adversarial training
Train stably using straightforward maximum likelihood objectives

Two flavours of flow

The 2022–2026 flow-matching generation (Lipman, Liu, Albergo, …) trades the exact-likelihood / structured-Jacobian requirement for architectural freedom — flow-matching networks are unconstrained DiTs / U-Nets trained by L2 regression on a vector field. They retain tractable likelihood through the probability-flow ODE, but in practice are used purely as fast generators. This document covers both: the discrete-step classical-flow architectures (Sections "Flow Model Architectures" → "Training Normalizing Flows") and the continuous-time / flow-matching family (Sections "Continuous-Time Flows" → "Recent Advances").

Breakthroughs from 2023 through 2026 — flow matching, rectified flow, mean flows, discrete flow matching, and Riemannian / equivariant variants — have closed the performance gap with diffusion: rectified-flow transformers now power SD3 and FLUX.1, mean flows match diffusion FID at one NFE (Geng et al., NeurIPS 2025), and discrete flow matching matches autoregressive language models on math and reasoning benchmarks. NeurIPS 2025 alone accepted 30+ flow-matching papers and ICLR 2026 received 150+ flow-matching submissions — see Recent Advances (2024–2026) below.

The Intuition: Probability Transformations¤

Think of normalizing flows like a sequence of coordinate transformations on a map:

Start with simple terrain (base distribution) - a flat, uniform grid easy to sample from
Apply transformations - each step warps, stretches, and reshapes the terrain while maintaining a perfect one-to-one correspondence between original and transformed coordinates
Track volume changes - the Jacobian determinant measures how much each region expands or contracts, ensuring probability mass is conserved
Compose transformations - stack multiple simple warps to create arbitrarily complex landscapes (data distributions)

The critical insight: by carefully designing transformations where we can efficiently compute both the forward mapping and the volume change, we get a model that can both generate samples (apply the transformation) and evaluate probabilities (apply the inverse and account for volume changes).

Mathematical Foundation¤

The Change of Variables Formula¤

The change of variables formula serves as the cornerstone of all normalizing flow architectures. Given a random variable $\mathbf{z}$ with known density $p_\mathcal{Z}(\mathbf{z})$ and an invertible transformation $\mathbf{x} = f(\mathbf{z})$, the density of $\mathbf{x}$ becomes:

\[ p_\mathcal{X}(\mathbf{x}) = p_\mathcal{Z}(f^{-1}(\mathbf{x})) \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right| \]

Or equivalently in log space:

\[ \log p_\mathcal{X}(\mathbf{x}) = \log p_\mathcal{Z}(\mathbf{z}) + \log \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right| \]

where $\mathbf{z} = f^{-1}(\mathbf{x})$.

Geometric Intuition

The Jacobian determinant $\left| \det \frac{\partial f}{\partial \mathbf{z}} \right|$ quantifies the relative change in volume of an infinitesimal neighborhood under transformation $f$. When the transformation expands a region ($|\det J| > 1$), the probability density must decrease proportionally to conserve total probability mass. Conversely, contraction ($|\det J| < 1$) concentrates probability, increasing density.

For $D$-dimensional vectors, the Jacobian matrix $J_f(\mathbf{z})$ is the $D \times D$ matrix of partial derivatives $[\frac{\partial f_i}{\partial z_j}]$. Computing a general determinant requires $O(D^3)$ operations, which becomes intractable for high-dimensional data like 256×256 RGB images with $D = 196{,}608$ dimensions.

The entire field of normalizing flows revolves around designing transformations with structured Jacobians—triangular, diagonal, or block-structured matrices where determinants reduce to $O(D)$ computations.

Composing Multiple Transformations¤

A single invertible transformation typically provides limited modeling capacity. The power of flows emerges through composition: stacking $K$ transformations:

\[ \mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z}) \]

The log-likelihood decomposes additively:

\[ \log p_\mathcal{X}(\mathbf{x}) = \log p_\mathcal{Z}(\mathbf{z}_0) + \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial \mathbf{z}_{k-1}} \right| \]

where $\mathbf{z}_0 = \mathbf{z}$ and $\mathbf{z}_k = f_k(\mathbf{z}_{k-1})$ for $k=1,\ldots,K$.

graph TD
    Z0["z₀<br/>(Base)"] --> F1["f₁"]
    F1 --> Z1["z₁"]
    Z1 --> F2["f₂"]
    F2 --> Z2["z₂"]
    Z2 --> Dots["..."]
    Dots --> FK["f_K"]
    FK --> X["x<br/>(Data)"]

    F1 -.->|"log|det J₁|"| LogDet1["Σ log-det"]
    F2 -.->|"log|det J₂|"| LogDet1
    FK -.->|"log|det J_K|"| LogDet1

    style Z0 fill:#e1f5ff
    style X fill:#ffe1e1
    style LogDet1 fill:#fff3cd

Additive Structure in Log-Space

The chain rule for Jacobians states $\det J_{f_2 \circ f_1}(\mathbf{u}) = \det J_{f_2}(f_1(\mathbf{u})) \cdot \det J_{f_1}(\mathbf{u})$, so log-determinants simply add: $\log|\det J_\text{total}| = \sum_k \log|\det J_k|$. This ensures numerical stability and makes total computational cost $O(KD)$ when each layer has $O(D)$ Jacobian computation.

Three Requirements for Flow Layers¤

For a transformation $f$ to be a valid flow layer, it must satisfy:

Invertibility: $f$ must be bijective (one-to-one and onto)
Efficient Jacobian: $\log \left| \det \frac{\partial f}{\partial \mathbf{z}} \right|$ must be tractable to compute
Efficient Inverse: $f^{-1}$ must be computable efficiently (for sampling)

Different flow architectures make different trade-offs among these requirements.

Base Distribution¤

The base distribution $p_\mathcal{Z}(\mathbf{z})$ is typically chosen to be simple for efficient sampling:

Standard Gaussian (most common):

\[ p_\mathcal{Z}(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I}) = \frac{1}{(2\pi)^{D/2}} \exp\left(-\frac{1}{2}\|\mathbf{z}\|^2\right) \]

Uniform (less common):

\[ p_\mathcal{Z}(\mathbf{z}) = \mathcal{U}(\mathbf{z}; -a, a) = \frac{1}{(2a)^D} \mathbb{1}_{[-a,a]^D}(\mathbf{z}) \]

Flow Model Architectures¤

Artifex provides implementations of several flow architectures, each with different trade-offs between expressiveness, computational efficiency, and ease of use.

1. NICE: Pioneering Coupling Layers¤

NICE (Dinh, Krueger & Bengio, 2014) — Non-linear Independent Components Estimation — introduced additive coupling layers that made normalizing flows practical for high-dimensional data.

Coupling Layer Mechanism:

Given input $\mathbf{x} \in \mathbb{R}^D$, partition into $(\mathbf{x}_1, \mathbf{x}_2)$:

\[ \begin{align} \mathbf{y}_1 &= \mathbf{x}_1 \\ \mathbf{y}_2 &= \mathbf{x}_2 + m(\mathbf{x}_1) \end{align} \]

where $m$ can be any arbitrary function (typically a neural network).

Key Properties:

Volume-preserving: $\log|\det(\mathbf{J})| = 0$ (determinant is exactly 1)
Efficient inverse: $\mathbf{x}_1 = \mathbf{y}_1$, $\mathbf{x}_2 = \mathbf{y}_2 - m(\mathbf{y}_1)$
No Jacobian computation: The triangular structure makes the determinant trivial
Arbitrary coupling function: $m$ can be arbitrarily complex without affecting computational cost

graph TB
    X["Input x"] --> Split["Partition<br/>(x₁, x₂)"]
    Split --> X1["x₁<br/>(unchanged)"]
    Split --> X2["x₂"]

    X1 --> NN["Neural Network<br/>m(x₁)"]
    NN --> Add["y₂ = x₂ + m(x₁)"]
    X2 --> Add

    X1 --> Concat["Concatenate"]
    Add --> Concat
    Concat --> Y["Output y"]

    style X fill:#e1f5ff
    style Y fill:#ffe1e1
    style NN fill:#fff3cd

When to Use NICE:

Fast forward and inverse computations required
Volume-preserving transformations are acceptable
Starting point for understanding coupling layers
Lower-dimensional problems (hundreds of dimensions)

2. RealNVP: Adding Scale for Expressiveness¤

RealNVP (Dinh, Sohl-Dickstein & Bengio, 2017 — ICLR) — Real-valued Non-Volume Preserving — extends NICE with affine coupling layers:

\[ \begin{align} \mathbf{y}_1 &= \mathbf{x}_1 \\ \mathbf{y}_2 &= \mathbf{x}_2 \odot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1) \end{align} \]

where $s(\cdot)$ and $t(\cdot)$ are neural networks outputting scale and translation, and $\odot$ denotes element-wise multiplication.

Key Properties:

Tractable Jacobian: $\log|\det(\mathbf{J})| = \sum_i s_i(\mathbf{x}_1)$
Efficient inverse: $$ \begin{align} \mathbf{x}_1 &= \mathbf{y}_1 \ \mathbf{x}_2 &= (\mathbf{y}_2 - t(\mathbf{y}_1)) \odot \exp(-s(\mathbf{y}_1)) \end{align} $$
Alternating masks: Alternate which dimensions are transformed across layers
No gradient through scale/translation: $s$ and $t$ can be arbitrarily complex ResNets

graph TB
    X["Input x"] --> Split["Split<br/>(x₁, x₂)"]
    Split --> X1["x₁<br/>(unchanged)"]
    Split --> X2["x₂"]

    X1 --> NN["Neural Networks<br/>s(x₁), t(x₁)"]
    NN --> Scale["exp(s)"]
    NN --> Trans["t"]

    X2 --> Mult["⊙"]
    Scale --> Mult
    Mult --> Add["+ t"]
    Trans --> Add
    Add --> Y2["y₂"]

    X1 --> Concat["Concatenate"]
    Y2 --> Concat
    Concat --> Y["Output y"]

    style X fill:#e1f5ff
    style Y fill:#ffe1e1
    style NN fill:#fff3cd

Multi-Scale Architecture:

RealNVP introduced hierarchical structure that revolutionized flow-based modeling:

Squeeze operation: Reshape $s \times s \times c$ tensors into $\frac{s}{2} \times \frac{s}{2} \times 4c$
Factor out: After several coupling layers, factor out half the channels to the prior
Continue processing: Transform remaining channels at higher resolution

This enables modeling 256×256 images by avoiding the prohibitive cost of applying dozens of layers to all 196,608 dimensions simultaneously.

When to Use RealNVP:

Need both fast sampling and density estimation
Working with continuous data, especially images
Image generation tasks at moderate to high resolution
Moderate-dimensional data (hundreds to thousands of dimensions)

3. Glow: Learnable Permutations¤

Glow (Kingma & Dhariwal, 2018 — NeurIPS) extends RealNVP with three key block-level innovations. Artifex retains these Glow blocks as a single-scale image flow baseline rather than the paper's squeeze/split multi-scale hierarchy.

Glow Block Architecture:

Each flow step combines three layers:

graph TB
    X["Input"] --> AN["ActNorm<br/>(Activation Normalization)"]
    AN --> Conv["Invertible 1×1 Conv<br/>(Channel Mixing)"]
    Conv --> Coup["Affine Coupling Layer<br/>(Transformation)"]
    Coup --> Y["Output"]

    style X fill:#e1f5ff
    style Y fill:#ffe1e1
    style AN fill:#d4edda
    style Conv fill:#d1ecf1
    style Coup fill:#fff3cd

1. ActNorm (Activation Normalization):

Per-channel affine transformation with trainable parameters:

\[ y_{i,j,c} = s_c \cdot x_{i,j,c} + b_c \]

Data-dependent initialization: normalize first minibatch to zero mean, unit variance
Enables training with batch size 1 (critical for high-resolution images)
$\log|\det J| = H \cdot W \cdot \sum_c \log|s_c|$ for $H \times W$ spatial dimensions

2. Invertible 1×1 Convolution:

Learned linear mixing of channels using invertible matrix $\mathbf{W}$:

\[ \mathbf{y} = \mathbf{W} \mathbf{x} \]

Replaces fixed permutations with learned channel mixing
LU decomposition: $\mathbf{W} = \mathbf{P} \cdot \mathbf{L} \cdot (\mathbf{U} + \text{diag}(\mathbf{s}))$
Determinant: $\log|\det \mathbf{W}| = \sum_i \log|s_i|$ (reduced to $O(c)$)
Improved log-likelihood by ~0.5 bits/dimension over fixed permutations

3. Affine Coupling Layer:

Similar to RealNVP but with the above improvements.

When to Use Glow in Artifex:

Need a fixed-size image flow with learnable channel mixing
Want exact likelihood with image-shaped forward and inverse transforms
Need ActNorm plus invertible 1×1 convolutions as a stronger image baseline
Want a deeper single-scale alternative to RealNVP on image tensors

Artifex Runtime Note

The retained implementation is configured through image_shape and blocks_per_scale. If you need the original paper's multi-scale squeeze/split hierarchy, you will need a different runtime.

4. MAF: Masked Autoregressive Flow¤

MAF (Papamakarios, Pavlakou & Murray, 2017 — NeurIPS) uses autoregressive transformations where each dimension depends on all previous dimensions, providing maximum expressiveness at the cost of sequential sampling.

Autoregressive Transformation:

\[ z_i = (x_i - \mu_i(x_{<i})) \cdot \exp(-\alpha_i(x_{<i})) \]

where $\mu_i$ and $\alpha_i$ are computed by a MADE (Masked Autoencoder for Distribution Estimation) network.

MADE Architecture (Germain et al., 2015 — ICML):

Uses masked connections to ensure autoregressive property—each output depends only on previous inputs:

graph TB
    X1["x₁"] --> H1["h₁"]
    X2["x₂"] --> H1
    X2 --> H2["h₂"]
    X3["x₃"] --> H2
    X3 --> H3["h₃"]

    H1 --> Z1["μ₁, α₁"]
    H1 --> Z2["μ₂, α₂"]
    H2 --> Z2
    H2 --> Z3["μ₃, α₃"]
    H3 --> Z3

    style X1 fill:#e1f5ff
    style X2 fill:#e1f5ff
    style X3 fill:#e1f5ff
    style Z1 fill:#ffe1e1
    style Z2 fill:#ffe1e1
    style Z3 fill:#ffe1e1

Trade-offs:

Direction	Complexity	Use Case
Forward (density)	$O(1)$ passes	All dimensions computed in parallel
Inverse (sampling)	$O(D)$ passes	Sequential computation required

When to Use MAF:

Density estimation is the primary goal
Sampling speed is less critical
Tabular or low-to-moderate dimensional data
Need highly expressive transformations
All dimensions should interact

5. IAF: Inverse Autoregressive Flow¤

IAF (Kingma et al., 2016 — NeurIPS) is the "inverse" of MAF with opposite computational trade-offs:

\[ y_i = x_i \cdot \exp(\alpha_i(y_{<i})) + \mu_i(y_{<i}) \]

Trade-offs:

Direction	Complexity	Use Case
Forward (density)	$O(D)$ passes	Sequential computation
Inverse (sampling)	$O(1)$ passes	All dimensions computed in parallel

When to Use IAF:

Fast sampling is the primary goal
Density estimation is secondary or not needed
Variational inference (amortized inference in VAEs)
Real-time generation applications

6. Neural Spline Flows¤

Neural Spline Flows (Durkan, Bekasov, Murray & Papamakarios, 2019 — NeurIPS) use monotonic rational-quadratic splines to create highly expressive yet tractable transformations.

Rational-Quadratic Spline Transform:

Each spline maps interval $[-B, B]$ to itself using $K$ rational-quadratic segments, parameterized by:

$K+1$ knot positions $\{(x^{(k)}, y^{(k)})\}$
$K+1$ derivative values $\{\delta^{(k)}\}$

Within segment $k$, the transformation applies a ratio of quadratic polynomials.

Key Properties:

Strict monotonicity: Ensures invertibility
Smooth derivatives: No discontinuities (unlike piecewise-linear)
Closed-form operations: Forward evaluation, analytic inverse (quadratic equation), closed-form derivative
Universal approximation: With sufficient bins (8-16 typically suffice)

graph LR
    Input["x"] --> Spline["Monotonic<br/>Rational-Quadratic<br/>Spline"]
    Spline --> Output["y"]
    Params["Knot positions<br/>Derivatives"] -.-> Spline

    style Input fill:#e1f5ff
    style Output fill:#ffe1e1
    style Params fill:#fff3cd

Compared to alternatives:

vs Affine: ~23 parameters per dimension vs 2, much more expressive
vs Neural Autoregressive Flows: No iterative root-finding needed
vs Flow++: No bisection algorithms required
vs Piecewise-linear: Smooth derivatives improve optimization

Results:

CIFAR-10: 3.38 bits/dimension using 10× fewer parameters than Glow
Best-in-class likelihood on multiple density estimation benchmarks

When to Use Neural Spline Flows:

Maximum expressiveness with tractability
Density estimation on complex distributions
Want fewer parameters than Glow
Need smooth, differentiable transformations

Training Normalizing Flows¤

Maximum Likelihood Objective¤

Flow training optimizes the straightforward objective of maximum likelihood:

\[ \max_\theta \mathbb{E}_{x \sim p_\text{data}}[\log p_\theta(x)] \]

Equivalently, minimize negative log-likelihood:

\[ \mathcal{L}(\theta) = -\mathbb{E}_{x \sim p_\text{data}}\left[\log p_u(f^{-1}_\theta(x)) + \log \left| \det \frac{\partial f^{-1}_\theta}{\partial x} \right|\right] \]

This simplicity contrasts sharply with:

GANs: Adversarial minimax optimization with mode collapse risks
VAEs: ELBO with reconstruction-regularization trade-off
Diffusion: Multi-step denoising with noise schedule design

Training Stability

Gradients flow through the entire composition automatically via backpropagation. Standard optimizers like Adam with learning rates around $10^{-3}$ work reliably. The monotonic improvement of likelihood makes training highly stable.

Critical Preprocessing Steps¤

Proper preprocessing proves essential for successful training:

1. Dequantization:

Discrete data (e.g., uint8 images) create delta peaks in continuous space, allowing flows to assign arbitrarily high likelihood to exact discrete values while ignoring intermediate regions.

# Uniform dequantization
rng, noise_key = jax.random.split(rng)
x_dequantized = x + jax.random.uniform(noise_key, x.shape, dtype=x.dtype) / 256.0

# Variational dequantization (more sophisticated)
noise = flow_model_for_noise(x)
x_dequantized = x + noise / 256.0

2. Logit Transform:

Maps bounded $[0,1]$ data to unbounded $(-\infty, +\infty)$ space matching Gaussian priors:

# Add small constant for numerical stability
alpha = 0.05
x = alpha + (1 - 2*alpha) * x

# Apply logit transform
x_logit = jnp.log(x) - jnp.log1p(-x)

Critical Importance

Without these preprocessing steps, training diverges immediately as the model tries to match bounded data to Gaussian base distributions.

Numerical Stability Techniques¤

1. Log-Space Computation:

Never compute $\det(J)$ directly—immediate overflow/underflow:

# WRONG
det_J = jnp.linalg.det(jacobian)
log_det = jnp.log(det_J)  # Overflow!

# CORRECT
sign, log_det = jnp.linalg.slogdet(jacobian)
# Or for triangular Jacobians:
log_det = jnp.sum(jnp.log(jnp.abs(jnp.diagonal(jacobian))))

2. Gradient Clipping:

Prevents exploding gradients in deep architectures (10-15+ layers):

tx = optax.chain(
    optax.clip_by_global_norm(1.0),
    optax.adam(learning_rate=1e-3),
)

3. Normalization Layers:

Batch normalization or ActNorm stabilizes intermediate representations:

# ActNorm: data-dependent initialization
scale, bias = compute_initial_stats(first_batch)
# Then treat as learnable parameters

4. Learning Rate Schedules:

Polynomial decay with warmup improves convergence:

# Warmup: 2-10 epochs linear increase
# Main training: polynomial decay from 1e-3 to 1e-4
scheduler = PolynomialDecay(optimizer, warmup_steps=1000,
                           total_steps=100000)

Monitoring Training¤

Watch these metrics:

Negative log-likelihood: Should decrease steadily
Per-layer log-determinants: Monitor for sudden spikes (numerical issues)
Reconstruction error: $\|x - f(f^{-1}(x))\| < 10^{-5}$ for numerical stability
Bits per dimension: For images, $\text{bpd} = \frac{\text{NLL}}{D \cdot \log 2}$

# Invertibility check
x_reconstructed = flow.inverse(flow.forward(x))
recon_error = jnp.mean(jnp.abs(x - x_reconstructed))
assert recon_error < 1e-5, f"Poor invertibility: {recon_error}"

Common Pitfalls and Solutions¤

Missing Preprocessing

Symptom: Immediate divergence or NaN losses

Solution: Always dequantize discrete data and apply logit transform
Numerical Instability

Symptom: Sudden spikes in log-determinants or NaN gradients

Solution: Use log-space computation, gradient clipping, monitor per-layer statistics
Poor Invertibility

Symptom: $\|x - f^{-1}(f(x))\| > 10^{-3}$

Solution: Use residual flows with soft-thresholding, reduce depth, check numerical precision
Slow Convergence

Symptom: Likelihood plateaus early

Solution: Increase model capacity, add more layers, use spline flows, check preprocessing

Continuous-Time Flows and Foundational Flow Matching¤

The architectures above (NICE through Neural Spline) are discrete-step flows: a finite stack of invertible layers, each with a tractable structured Jacobian. The continuous-time variants below replace the layer stack with an ODE / vector-field formulation. This is the lineage that produced flow matching, rectified flow, and ultimately the rectified-flow MMDiT in Stable Diffusion 3 and FLUX.1. The recent 2024–2026 specialisations (Mean Flows, OT-CFM, CFG-Zero*, Stable Velocity, etc.) are organised in their own Recent Advances section below.

Continuous Normalizing Flows (Neural ODEs)¤

Continuous flows (Chen et al., 2018 — NeurIPS Best Paper; Grathwohl et al., 2019 — FFJORD) parameterize the derivative of the hidden state:

\[ \frac{d\mathbf{z}_t}{dt} = f(\mathbf{z}_t, t, \theta) \]

The output $\mathbf{z}_1$ at time $t=1$ given initial condition $\mathbf{z}_0$ at $t=0$ is computed using ODE solvers.

Key Innovation (FFJORD):

The change in log-density follows:

\[ \frac{d \log p(\mathbf{z}_t)}{dt} = -\text{Tr}\left(\frac{\partial f}{\partial \mathbf{z}_t}\right) \]

Hutchinson's trace estimator makes this tractable:

\[ \text{Tr}(A) \approx \mathbf{v}^T A \mathbf{v} \quad \text{where } \mathbf{v} \sim \mathcal{N}(0, I) \]

This unbiased stochastic estimate requires only one Jacobian-vector product per sample, reducing complexity from $O(D^2)$ to $O(D)$.

Advantages:

No architectural constraints: Any neural network architecture works
Flexible expressiveness: Can model disconnected regions and sharp boundaries
Adjoint method: Memory-efficient training ($O(1)$ memory vs $O(\text{depth})$)

Challenges:

Unpredictable cost: Number of function evaluations adapts to complexity
Stiff dynamics: Can struggle with certain distributions
Slower than discrete flows: Requires ODE integration

When to Use:

Modeling distributions with disconnected regions
Physics simulation, molecular dynamics
Scientific domains requiring flexible unrestricted networks

Residual Flows: Invertible ResNets¤

Residual flows (Behrmann et al., 2019 — i-ResNet, ICML; Chen et al., 2019 — Residual Flows, NeurIPS) make standard ResNet architectures $F(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})$ invertible by constraining $g$ to be contractive with Lipschitz constant $L < 1$.

Invertibility via Fixed-Point Iteration:

The Banach fixed-point theorem guarantees bijection with inverse computable via:

\[ \mathbf{x}_{k+1} = \mathbf{y} - g(\mathbf{x}_k) \]

which converges exponentially to the true inverse.

Spectral Normalization:

Enforces the constraint by normalizing weight matrices:

\[ \mathbf{W} \leftarrow \frac{\mathbf{W}}{\|\mathbf{W}\|_2 / c} \quad \text{where } c < 1 \]

The spectral norm $\|\mathbf{W}\|_2$ is estimated via power iteration.

Russian Roulette Estimator:

For $F(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})$, the log-determinant has power series:

\[ \log|\det(I + J_g)| = \sum_{k=1}^{\infty} \frac{(-1)^{k+1}}{k} \text{Tr}(J_g^k) \]

Rather than truncating (introducing bias), Russian roulette randomly terminates the series with probability ensuring unbiasedness while maintaining finite computation.

When to Use:

High-dimensional problems (>1000 dimensions)
Want free-form Jacobians (all dimensions interact)
Need competitive density estimation with flexibility
Extensions like Invertible DenseNets for parameter efficiency

Flow Matching: Simulation-Free Training¤

Flow Matching (Lipman, Chen, Ben-Hamu, Nickel & Le, 2023 — ICLR) introduced a paradigm shift for training continuous normalizing flows without ODE simulation.

Key Idea:

Rather than integrating forward dynamics during training (as in Neural ODEs), perform regression on the vector field of fixed conditional probability paths.

Training Procedure:

Given samples $\mathbf{x}_0 \sim p_0$ and $\mathbf{x}_1 \sim p_1$
Define interpolant: $\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$
Train neural network $\mathbf{v}_\theta(\mathbf{x}_t, t)$ to match conditional vector field:

\[ \min_\theta \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1}\left[\|\mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_1 - \mathbf{x}_0)\|^2\right] \]

This is simple L2 regression requiring no simulation.

Inference:

Integrate the learned field using standard ODE solvers:

\[ \frac{d\mathbf{x}}{dt} = \mathbf{v}_\theta(\mathbf{x}, t), \quad t \in [0, 1] \]

Optimal Transport Flow Matching:

Uses minibatch optimal transport to couple noise and data samples before interpolation, creating straighter paths that require fewer integration steps.

Results:

State-of-the-art on ImageNet
Better likelihood and sample quality than simulation-based methods
Extensions to Riemannian manifolds, discrete data, video generation

When to Use:

Training continuous flows efficiently
Want simulation-free gradients
Need strong likelihood estimates
See Recent Advances → Conditional Flow Matching for OT-CFM, Stochastic Interpolants, and the 2024–2026 production picture.

Rectified Flows: Learning Straight Trajectories¤

Rectified Flow (Liu, Gong & Liu, 2023 — ICLR) learns ODEs following straight-line paths connecting source and target distributions.

Training:

Given coupling between noise samples $\mathbf{u} \sim p_0$ and data samples $\mathbf{x} \sim p_1$:

Linearly interpolate: $\mathbf{x}_t = (1-t)\mathbf{u} + t\mathbf{x}$
Learn velocity field: $\frac{d\mathbf{x}_t}{dt} = \mathbf{v}_\theta(\mathbf{x}_t, t)$

Reflow Process (Key Innovation):

Train initial model
Generate paired samples: $(\mathbf{u}, \mathbf{x}_{\text{gen}})$ where $\mathbf{x}_{\text{gen}} = \text{model}(\mathbf{u})$
Retrain on these pairs

This iterative rectification progressively straightens trajectories.

Benefits:

Provably non-increasing convex transport costs
One-step or few-step generation: Straight paths require minimal integration
One reflow iteration typically suffices under realistic settings

Applications:

Stable Diffusion 3: Uses rectified flow formulation, outperforms pure diffusion
InstaFlow: Achieves 0.1-second generation demonstrating practical viability
One-step generation for real-time applications

When to Use:

Need few-step or one-step generation.
Real-time applications requiring fast inference.
Want to distill models for deployment.
The successor methods that achieve one-step generation without distillation (Mean Flows, TarFlow) are covered in Recent Advances → Mean Flows.

Discrete Flow Matching¤

Discrete Flow Matching (Gat et al., 2024 — NeurIPS Spotlight; Campbell et al., 2024 — NeurIPS) extends flows to discrete data (text, molecules, code) using Continuous-Time Markov Chains (CTMC).

Problem:

Traditional flows designed for continuous data. Dequantization workarounds prove inadequate for inherently discrete data like language.

Solution:

CTMC process with learnable time-dependent transition rates:

\[ \frac{dp_t(x)}{dt} = \sum_{x'} p_t(x') R_t(x' \to x) - p_t(x) \sum_{x'} R_t(x \to x') \]

Training:

Regress on conditional flow:

\[ \min_\theta \mathbb{E}_{t, x_0, x_1}\left[\|R_\theta(x_t, t) - R_{\text{true}}(x_t, t | x_0, x_1)\|^2\right] \]

Results:

FlowMol-CTMC: State-of-the-art molecular validity
Code generation: 1.7B parameter model achieves 13.4% Pass@10 on HumanEval
DNA sequence design: Dirichlet Flow Matching

When to Use:

Text generation (alternative to autoregressive)
Molecular generation with discrete atom types
Code generation
Any discrete structured data

Geometric Flows: Riemannian Manifolds¤

Riemannian Flow Matching (Chen & Lipman, 2024) extends flows to non-Euclidean geometries, critical for data on manifolds.

Applications:

Molecular conformations: SE(3) equivariant flows
Protein structures: SO(3) rotations and translations
Robotic configurations: Configuration space manifolds
Materials: FlowMM for crystal structure generation (3× efficiency improvement)

Key Idea:

Replace Euclidean straight-line interpolants with geodesics on the manifold:

\[ \frac{d\mathbf{x}_t}{dt} = \mathbf{v}_\theta(\mathbf{x}_t, t) \quad \text{on manifold } \mathcal{M} \]

When to Use:

Data naturally lives on manifolds
Symmetries and geometric constraints are important
Protein design, molecular generation, materials discovery
3D geometry and robotics applications

Recent Advances (2024–2026): Flow Matching, Mean Flows, and Modern Variants¤

If 2014–2020 was about making flows tractable (NICE → RealNVP → Glow → splines) and 2018–2022 about making them continuous (Neural ODEs, FFJORD), 2023–2026 is about flow matching as the unifying training objective for both flows and diffusion. The field has exploded — NeurIPS 2025 accepted 30+ flow-matching papers and ICLR 2026 received 150+ submissions (npj AI 2025 survey; Awesome Flow Matching list). This section organises the headline 2024–2026 contributions into six themes.

Conditional Flow Matching and Optimal-Transport Couplings¤

Conditional Flow Matching (CFM) / Independent CFM (I-CFM) (Tong et al., 2024 — TMLR) generalises Lipman et al.'s flow matching to arbitrary noise–data couplings, and introduces:

OT-CFM — uses minibatch optimal-transport couplings instead of independent pairs, dramatically straightening trajectories and reducing inference NFEs.
SB-CFM — a Schrödinger-bridge variant that combines OT with stochastic dynamics.

This is the practical recipe behind most modern flow-matching codebases, including the open-source TorchCFM library.

Stochastic Interpolants (Albergo & Vanden-Eijnden, 2023; Albergo et al., 2024) generalise flow matching with stochastic couplings and bridge ODEs and SDEs in a single framework — the framework that underlies SiT, Multitask Stochastic Interpolants (2026), and several recent unified objectives.

Generator Matching (Holderrieth et al., 2024) provides a single objective that subsumes flow matching, score matching, and discrete-flow training — the conceptual capstone of the 2023–2024 line of work.

Mean Flows: One-Step Generation Without Distillation¤

Mean Flows (Geng, Deng, Bai, Kolter & He, 2025 — NeurIPS Oral) replace flow matching's instantaneous velocity field with the average velocity along a trajectory:

\[ u^{\text{mean}}_\theta(x_t, s, t) \;\approx\; \frac{1}{t - s}\!\int_s^t v(x_\tau, \tau)\, d\tau. \]

Training uses a closed-form identity between average and instantaneous velocities, requiring no pre-training, no distillation, and no curriculum. The single-network model achieves:

FID 3.43 at 1 NFE on ImageNet 256² — trained from scratch (vs. flow-matching teachers needing 25–50 NFE for matching quality).

Follow-ups include AlphaFlow (2025–2026) — a careful study of why MeanFlow works and how to improve its variance — and Pixel Mean Flows (2026) — one-step latent-free image generation directly in pixel space. Mean flows are the cleanest demonstration to date that flows can match diffusion at near-zero inference cost without adversarial distillation.

Classifier-Free Guidance for Flow Matching¤

Standard CFG (Ho & Salimans, 2022) was designed for diffusion models; the 2025 wave adapts it specifically to flow matching:

CFG-Zero* (Fan et al., 2025) — observes that early-step velocity estimates undershoot in flow ODEs, and introduces (a) an optimised guidance scale correcting velocity inaccuracies, plus (b) zero-init, which zeroes the first few ODE-solver steps. Consistently outperforms standard CFG on Lumina-Next, SD3, FLUX.1, and Wan-2.1.
Flow-Matching Guidance Theory (2024–25) — first general framework deriving guidance techniques applicable to flow matching beyond CFG.

Discrete Flow Matching at Scale¤

Discrete flow matching has matured rapidly:

DFM (Gat et al., 2024 — NeurIPS Spotlight) and Multiflow (Campbell et al., 2024) — CTMC-based flows for categorical data; foundation for the modern discrete-flow literature.
Fisher Flow (Davis et al., 2024) — a geometric discrete flow that interprets categorical distributions as points on a statistical manifold with the Fisher–Rao metric.
DeFoG (Ho et al., ICML 2025) — DFM applied to graph generation; current SOTA on molecular graph synthesis.
Discrete Flow Matching as AR Post-Training (2025) — surprisingly effective post-training to fix compound-error accumulation in autoregressive language models.

These connect to the broader diffusion language model family (LLaDA, Mercury, Dream — see the Diffusion explainer). DFM and masked discrete diffusion are unified under the same CTMC formalism.

Variance-Reduced Training and Inference-Time Scaling¤

A 2025–2026 line of work attacks the variance of flow-matching gradient estimators and the inference compute budget:

Stable Velocity (2026) — a variance-perspective on flow matching that reveals a two-regime structure; introduces StableVM for unbiased variance-reduced training, VA-REPA for selective auxiliary supervision, and StableVS for finetuning-free acceleration.
Inference-Time Compute Scaling for Flow Matching (2025) — first method to demonstrate inference-time compute scaling for FM models, with applications to protein design and folding.
An Error Analysis of Flow Matching (NeurIPS 2024) — first complete theoretical analysis of how errors compound in flow-matching CNFs; informs the variance-reduction methods above.
Local Flow Matching (NeurIPS 2024) and Wasserstein Flow Matching (NeurIPS 2024) — generative modelling over families of distributions and over Wasserstein space respectively.

Flow Map Matching and Consistency Bridges¤

Flow Map Matching (Boffi et al., 2024) is a unifying mathematical framework that subsumes consistency models, mean flows, and rectified-flow distillation under a single objective: learn a flow map $\Phi(x_t, s, t)$ that pushes samples between any two times $s$ and $t$ along the ODE. Mean flows correspond to a particular parameterisation; consistency models are the special case $s = 0$.

Production-Scale Flow Matching¤

By 2026, rectified-flow / flow-matching transformers are a common recipe for new large-scale visual generators, especially when paired with latent codecs and DiT / MMDiT backbones:

Stable Diffusion 3 / 3.5 (Esser et al., 2024 — MMDiT) — rectified-flow MMDiT with Logit-Normal $t$ sampling.
FLUX.1 (Black Forest Labs, 2024) — 12B-parameter rectified-flow transformer.
HunyuanVideo, Wan 2.1 / 2.2, CogVideoX, Mochi-1 — all flow-matching DiTs over video VAE latents.
Flowception (2026) — temporally-expansive, variable-length non-autoregressive video flow matching.

Surveys and Resources¤

Resource	Year	Scope
Flow Matching Meets Biology and Life Science: A Survey (Nature npj AI)	2025	Comprehensive recent survey; biological applications + variant taxonomy
Flow matching for generative modelling in bioinformatics (Nature MI)	2026	Theoretical foundations + computational biology applications
Awesome Flow Matching	continuously updated	Curated list of flow-matching papers
TorchCFM	continuously updated	Reference implementation: vanilla / OT / Schrödinger CFM

Comparing Flows with Other Generative Models¤

Flows vs VAEs: Exact Likelihood vs Learned Compression¤

Aspect	Normalizing Flows	VAEs
Likelihood	Exact (classical flows) / tractable via probability-flow ODE (CNFs / flow matching)	Lower bound (ELBO)
Dimensionality	Input = Output (no compression)	Compressed latent ($\dim(z) \ll \dim(x)$)
Sample Quality	Historically blurry (Glow, RealNVP); modern flow-matching variants (TarFlow, Mean Flows) match diffusion	Blurry under L2; sharp once paired with adversarial / VQ losses
Training	Maximum likelihood (classical) or vector-field regression (flow matching)	ELBO (reconstruction + KL)
Mode Coverage	Excellent (exact density modelling)	Excellent; can suffer posterior collapse
Generation Speed	Single pass (classical) or 1–10 ODE steps (flow matching)	Single pass
Role in modern stacks	The generator in latent-space pipelines (SD3, FLUX.1)	The codec feeding flow / diffusion priors

When to Choose:

Use classical flows when exact likelihood is essential (anomaly detection, density estimation, model comparison) or lossless reconstruction matters; use flow matching / rectified flow when stable vector-field training and few-step generation matter more than exact likelihood
Use VAEs when compressed latent representations provide value for downstream tasks, interpretability matters, or computational constraints favor smaller latent spaces
Hybrid f-VAEs: Combine both—VAEs with flow-based posteriors or decoders

Flows vs GANs: Mode Coverage vs Sample Quality¤

Aspect	Normalizing Flows	GANs
Sample Quality	Pre-2024: noticeably below GANs; 2024+: TarFlow, Mean Flows, Rectified Flow match GAN quality	Strong; modern R3GAN / GigaGAN remain competitive after distillation
Training Stability	MLE / vector-field regression — stable	Adversarial; needs spectral norm + R1 + careful tuning
Likelihood	Exact (classical) / estimated by ODE divergence integral (FM / CNF)	None (implicit)
Mode Coverage	Excellent (density-based)	Prone to mode collapse
Standard evaluation	Negative log-likelihood, BPD	FID, Inception Score, KID
Consistency	Reconstructive (classical flows)	None — no encoder

When to Choose:

Use Flows for stable training, guaranteed mode coverage, likelihood evaluation, exact density scoring (anomaly detection), consistent reconstruction.
Use GANs when perceptual quality is paramount and one-step inference is non-negotiable; modern recipes (R3GAN, GigaGAN) close the historical-stability gap.
Hybrid: most modern fast-generators use adversarial losses on top of flow / diffusion teachers (ADD, DMD2, FLUX.1 [schnell]). The clean "flows vs GANs" dichotomy has dissolved in production.

Flows vs Diffusion: Speed vs Quality with Converging Trajectories¤

Historical Trade-off:

Diffusion: Superior sample quality, but 50-1000 iterative denoising steps
Flows: Fast single-step sampling, but lower perceptual quality

2023–2026 Developments Have Weakened the Trade-off:

Rectified flows (Liu et al., 2023) — straight paths enabling few-step generation; powers SD3 and FLUX.1.
Flow matching (Lipman et al., 2023) — simulation-free training, the unifying objective for both flows and diffusion.
TarFlow (Zhai et al., 2024) — transformer flows matching diffusion quality at one NFE.
Mean Flows (Geng et al., NeurIPS 2025 Oral) — FID 3.43 at 1 NFE on ImageNet 256² with no distillation, no curriculum.
CFG-Zero* (Fan et al., 2025) — flow-specific guidance closing the conditional-generation gap with diffusion.

Aspect	Classical Flows (NICE/RealNVP/Glow)	Modern Flow Matching	Diffusion Models
Sampling Speed	Single pass	1–10 ODE steps (Mean Flows / TarFlow at 1; Rectified Flow at 1–4)	20–1000 steps un-distilled; 1–4 distilled
Sample Quality	Below GANs and diffusion	Competitive with diffusion on several benchmarks (TarFlow 2024, Mean Flows 2025)	Leading on many visual benchmarks
Likelihood	Exact via change of variables	Estimated by ODE divergence integral	ELBO / probability-flow ODE estimates
Training Stability	Stable (MLE)	Stable (vector-field regression)	Stable (denoising)
Jacobian computation	Required (structured $O(D)$)	Not required (regression objective)	Not required
Architectural constraints	Invertibility + equal dimensions	None — any DiT / U-Net	None

When to Choose:

Use classical flows when you need exact likelihood (anomaly detection, density estimation, model comparison) or guaranteed lossless reconstruction.
Use flow matching for fast image / video generation at production scale — the rectified-flow MMDiT recipe (SD3, FLUX.1) and Mean Flows are the 2026 default.
Use diffusion when you need the most mature post-training tooling (DPO, RLHF, ControlNet, IP-Adapter) — though most of these are now being ported to flow matching.
Hybrid is the norm: Flow matching is a CNF objective and also a diffusion training objective — the unifying Generator Matching framework subsumes both.

Practical Implementation Guidance¤

Framework and Package Selection¤

Modern Flow Implementations:

:simple-jax:{ .lg .middle } JAX Ecosystem
Distrax: Bijectors and distributions that compose with JAX transforms
FlowJAX: JAX-native flow utilities and common transformations
Optax: Optimizer chains for clipping, schedules, and adaptive training
Composable Bijectors
Build RealNVP, Glow, and spline flows from invertible transformations
Keep log-determinant calculations explicit and testable
Use JAX transformations for batching, differentiation, and compilation
:simple-jax:{ .lg .middle } JAX Implementations
Distrax: High-performance flows with JAX transformations
Artifex: This repository—complete flows with Flax/NNX
Optimal for scientific computing and research

Framework Choice:

PyTorch: Dominates academic research, excellent debugging
TensorFlow: Production stability, enterprise deployment
JAX: High-performance scientific computing, automatic differentiation

Architecture Selection Guide¤

graph TD
    Start{{"What's your<br/>primary goal?"}}
    Start -->|"Exact density<br/>(anomaly detection,<br/>scientific NLL)"| Dense{{"Dimensionality?"}}
    Start -->|"High-quality<br/>generation"| Gen{{"Modality?"}}
    Start -->|"One-step<br/>real-time"| Fast["MeanFlow / TarFlow<br/>(Sec. Recent Advances)"]

    Dense -->|"Low-Med<br/>(< 100)"| MAF["MAF<br/>(Masked Autoregressive)"]
    Dense -->|"High<br/>(> 100)"| Spline["Neural Spline Flows"]
    Dense -->|"Very High<br/>(> 1000)"| Residual["Residual Flows or<br/>Continuous Flows"]

    Gen -->|"Image / video<br/>at scale"| FM["Flow Matching DiT<br/>(SD3 / FLUX-style)"]
    Gen -->|"Discrete<br/>(text, molecules,<br/>graphs)"| DFM["Discrete Flow Matching<br/>(DFM, Multiflow, DeFoG)"]
    Gen -->|"Manifold<br/>(proteins, materials)"| RFM["Riemannian<br/>Flow Matching"]
    Gen -->|"Small-scale<br/>baseline"| Glow["Glow / RealNVP"]

    style Start fill:#e1f5ff
    style Dense fill:#fff3cd
    style Gen fill:#fff3cd
    style MAF fill:#d4edda
    style Spline fill:#d4edda
    style Glow fill:#d4edda
    style FM fill:#d4edda
    style DFM fill:#d4edda
    style RFM fill:#d4edda
    style Residual fill:#d4edda
    style Fast fill:#d4edda

Recommended Hyperparameters by Task¤

Image Flows (Artifex Glow / RealNVP):

glow_config = {
    "image_shape": (32, 32, 3),
    "blocks_per_scale": 6-8,
    "coupling_hidden_dims": [512, 512],
    "batch_size": 32-64,
    "learning_rate": 1e-3,
    "lr_decay": "polynomial",  # decay to 1e-4
    "preprocessing": ["dequantize", "logit_transform"],
}

realnvp_config = {
    "input_dim": 32 * 32 * 3,
    "num_coupling_layers": 8-12,
    "coupling_hidden_dims": [512, 512],
    "batch_size": 64-128,
    "learning_rate": 1e-3,
    "lr_decay": "polynomial",  # decay to 1e-4
    "preprocessing": ["dequantize", "logit_transform"],
}

Density Estimation on Tabular Data (MAF/Neural Spline Flows):

config = {
    "num_transforms": 5-10,
    "hidden_dims": [512, 512],  # Match or exceed data dimensionality
    "num_bins": 8-16,  # For spline flows
    "batch_size": 256,
    "learning_rate": 5e-4,
    "preprocessing": ["standardization"],
}

Variational Inference (IAF/RealNVP):

config = {
    "num_steps": 4-8,
    "hidden_dims": [256, 256],
    "base_distribution": "gaussian",  # Learned mean/std
    "learning_rate": 1e-3,
    "annealing_schedule": "linear",  # For KL term
}

Flow Matching at Scale (SD3 / FLUX-style):

flow_matching_config = {
    "backbone": "DiT",                       # or MMDiT for text-to-image
    "hidden_dim": 1024 - 4096,
    "num_layers": 28 - 38,
    "patch_size": 2,                         # for latent diffusion
    "interpolant": "linear",                 # x_t = (1-t) x_0 + t x_1
    "coupling": "ot_minibatch",              # OT-CFM (Tong et al. 2024)
    "t_sampling": "logit_normal",            # Esser et al. 2024 (SD3)
    "guidance": "cfg_zero_star",             # Fan et al. 2025
    "ode_solver": "euler",                   # 25-50 NFEs un-distilled
    "optimizer": "adamw",
    "learning_rate": 1e-4,
    "weight_decay": 1e-2,
    "ema_decay": 0.9999,
}

For one-step generation without distillation, use the MeanFlow recipe (Geng et al., 2025) which adds a second time argument $s$ and a closed-form average-velocity identity to the same DiT backbone.

Common Implementation Pitfalls¤

Forgetting Preprocessing

# WRONG: Direct training on uint8 images
flow.train(images_uint8)

# CORRECT: Dequantize + logit transform
images = dequantize(images_uint8)
images = logit_transform(images, alpha=0.05)
flow.train(images)

Computing Determinants Directly

# WRONG: Direct determinant (overflow!)
det = jnp.linalg.det(jacobian)
log_det = jnp.log(det)

# CORRECT: Log-space computation
sign, log_det = jnp.linalg.slogdet(jacobian)
# Or for triangular matrices:
log_det = jnp.sum(jnp.log(jnp.abs(jnp.diagonal(jacobian))))

Ignoring Invertibility Checks

# Monitor reconstruction error
x_reconstructed = flow.inverse(flow.forward(x))
error = jnp.mean(jnp.abs(x - x_reconstructed))

# Should be < 1e-5 for numerical stability
if error > 1e-3:
    warnings.warn(f"Poor invertibility: {error}")

Wrong Coupling Architecture

# WRONG: Too deep coupling networks
coupling = MLP([512, 512, 512, 512, 512])  # Overkill!

# CORRECT: 2-3 layers sufficient
coupling = MLP([512, 512, output_dim])

Evaluation Metrics¤

Density-Estimation Metrics (Classical Flows)¤

The headline metric for any flow that gives an exact log-likelihood is bits per dimension ($\mathrm{BPD} = -\log_2 p_\theta(x) / D$). Standard reporting:

Dataset	Best classical-flow BPD (lower = better)	Reference
MNIST	~1.05	NSF, Glow
CIFAR-10 (uint8 + dequant)	~3.35	NSF
ImageNet 32 / 64	~3.81 / 3.66	Glow / Flow++
FFHQ-1024 (rescaled)	~3.0 (per pixel before dequant)	HDiT-flow hybrids

Always include the dequantisation scheme when reporting BPD — uniform vs variational dequantisation typically differ by 0.05–0.10 BPD.

Sample Quality (Flow Matching / TarFlow / Mean Flows)¤

Modern flow-matching DiTs are evaluated on the same metrics as diffusion models:

FID (Heusel et al., 2017) — Fréchet Inception Distance.
CMMD (Jayasumana et al., 2024) — CLIP-feature MMD; better correlated with human perception.
CLIPScore — alignment for text-conditional generation.
Inception Score (IS) — legacy class-conditional metric.

Reference numbers from the Recent Advances section: TarFlow matches diffusion FID at one NFE, Mean Flows reach FID 3.43 at 1 NFE on ImageNet 256², EqM (cross-listed in EBM literature) reaches FID 1.90.

Test Quantities Specific to Flows¤

Reconstruction error — $\|x - f^{-1}(f(x))\|$ should be ≤ 1e-5 in float32; a reliable indicator of numerical stability.
Per-layer log-determinants — sudden spikes signal training instability long before NaN.
Path-length metrics (rectified flow) — average $\|v_\theta(x_t, t)\|$ along the trajectory; a well-rectified model has nearly-constant velocity.

Production Considerations¤

Inference Cost¤

Method	NFEs (typical)	1024² latency on H100
Classical flow (Glow / NSF)	1 forward pass	~50–200 ms
Flow matching (50 NFE Euler)	50	~2 s
Flow matching (10 NFE DPM-Solver-3)	10	~400 ms
Rectified flow (1–4 NFE after reflow)	1–4	~50–150 ms
Mean Flow (1 NFE, NeurIPS 2025)	1	~40 ms
FLUX.1 [schnell] (latent adversarial distillation)	1–4	~80 ms

The single biggest production lever is distillation: Mean Flow / TarFlow at 1 NFE, or rectified-flow + reflow + LADD for 1–4 NFE, recover near-teacher quality at near-GAN inference cost.

Memory and Quantisation¤

Classical flows need to store the exact-precision Jacobian computations during forward and backward — INT8 PTQ is rarely safe; FP16 / BF16 work but require occasional FP32 fallbacks for slogdet.
Flow-matching DiTs follow the diffusion quantisation playbook (FP16 / BF16 universally safe, INT8 with per-channel calibration).

Common Production Issues¤

Drift on out-of-distribution inputs — classical flows assign well-defined likelihoods to OOD inputs but those likelihoods can be higher than in-distribution ones (Nalisnick et al., 2019) — a counter-intuitive failure mode that complicates OOD detection from $\log p_\theta$ alone.
Numerical reversibility — long flow stacks accumulate float-precision error; for diffusion-style flow matching the reverse trajectory should match within ~1e-3 in pixel space at FP32.
Adaptive ODE solvers can blow up the NFE count on stiff trajectories; production samplers usually pin a fixed-step solver (Euler, DPM-Solver-3) for predictable latency.

For the broader unified picture and how flows fit alongside diffusion / VAE / GAN / EBM / AR systems in 2026 stacks, see Generative Models — A Unified View.

Summary and Key Takeaways¤

Normalizing flows provide a unique combination of exact likelihood computation, fast sampling, and stable training through invertible transformations with tractable Jacobians.

Core Principles¤

Exact Likelihood

Flows compute exact probability through change of variables formula, enabling precise density estimation
Invertible Architecture

Bijective transformations allow both efficient sampling and exact inference
Tractable Jacobians

Structured Jacobians (triangular, diagonal) reduce complexity from $O(D^3)$ to $O(D)$
Stable Training

Maximum likelihood provides clear, monotonic objective without adversarial dynamics
Composable Design

Stack simple transformations to build arbitrarily complex distributions

Architecture Selection Matrix¤

Architecture	Density Estimation	Fast Sampling	Best For
NICE / RealNVP	⭐⭐⭐	⭐⭐⭐⭐⭐	Foundational baselines; small-scale images
Glow	⭐⭐⭐⭐	⭐⭐⭐⭐	Fixed-size image flows with learned channel mixing
MAF	⭐⭐⭐⭐⭐	⭐⭐	Density on tabular data
IAF	⭐⭐	⭐⭐⭐⭐⭐	Fast sampling; VAE posteriors / VI
Neural Spline (NSF)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Maximum expressiveness with tractable Jacobians
Residual Flows / FFJORD	⭐⭐⭐⭐	⭐⭐	Free-form Jacobians; high-dimensional data
Flow Matching DiT	⭐⭐⭐⭐ (ODE likelihood estimate)	⭐⭐⭐⭐ (10–30 NFEs)	Production T2I / T2V at scale (SD3, FLUX.1)
Rectified Flow	⭐⭐⭐⭐	⭐⭐⭐⭐⭐ (1–4 NFEs after reflow)	Few-step generation
TarFlow	⭐⭐⭐⭐	⭐⭐⭐⭐⭐ (1 NFE)	One-step generation matching diffusion
Mean Flows	⭐⭐⭐ (ODE likelihood estimate)	⭐⭐⭐⭐⭐ (1 NFE)	One-step from scratch, no distillation
Discrete Flow Matching	⭐⭐⭐⭐	⭐⭐⭐⭐	Text, code, molecular graphs
Riemannian Flow Matching	⭐⭐⭐⭐	⭐⭐⭐⭐	Proteins, materials, manifold data

Recent Advances (2023–2026)¤

Flow Matching (Lipman et al., 2023) — simulation-free CNF training; the unifying objective.
Rectified Flow (Liu et al., 2023) — straight-line ODEs; powers Stable Diffusion 3 and FLUX.1.
Conditional / OT-CFM (Tong et al., 2024) — minibatch-OT couplings; cleanest practical recipe.
TarFlow (Zhai et al., 2024) — transformer flow matching diffusion quality at one NFE.
Mean Flows (Geng et al., NeurIPS 2025 Oral) — average-velocity training; FID 3.43 at 1 NFE on ImageNet 256² with no distillation.
CFG-Zero* (Fan et al., 2025) — improved CFG specifically for flow matching.
Discrete Flow Matching (Gat et al., 2024; Campbell et al., 2024) — CTMC flows for text, molecules, graphs.
Riemannian / Geometric Flow Matching (Chen & Lipman, 2024) — flows on manifolds; the basis for protein and materials applications.
Flow Map Matching (Boffi et al., 2024) — unifying framework for consistency models, mean flows, and rectified-flow distillation.
Stable Velocity / Variance-reduced FM (2026) — unbiased variance-reduced training and finetuning-free acceleration.

When to Use Normalizing Flows¤

Best Use Cases:

Exact likelihood is essential (anomaly detection, model comparison)
Fast generation required (real-time audio, interactive systems)
Stable training preferred over adversarial methods
Lossless reconstruction needed
Mode coverage guarantees important

Avoid When:

Maximum perceptual quality is sole objective (use GANs/diffusion)
Compressed representations needed (use VAEs)
Architectural flexibility critical (diffusion has fewer constraints)
Very high dimensions with limited resources (consider latent diffusion)

Future Directions¤

One-step generation via rectified flows and distillation
Pyramidal structures for video and high-resolution media
Hybrid models combining flows with diffusion, transformers
Scientific applications in materials, proteins, molecular generation
Geometric awareness for data on manifolds

Next Steps¤

Flow User Guide

Practical usage guide with implementation examples and training workflows
Flow API Reference

Complete API documentation for RealNVP, Glow, MAF, IAF, and Neural Spline Flows
MNIST Tutorial

Step-by-step hands-on tutorial: train a flow model on MNIST from scratch
Planned Example Topics

Track still-unshipped advanced flow and diffusion example work

References and Further Reading¤

Seminal Papers (Must Read)¤

Dinh, L., Krueger, D., & Bengio, Y. (2014). "NICE: Non-linear Independent Components Estimation"
arXiv:1410.8516
First practical coupling layer architecture

Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). "Density estimation using Real NVP"
arXiv:1605.08803
Affine coupling layers and multi-scale architecture

Kingma, D. P., & Dhariwal, P. (2018). "Glow: Generative Flow with Invertible 1×1 Convolutions"
arXiv:1807.03039
State-of-the-art image generation with learnable permutations

Papamakarios, G., Pavlakou, T., & Murray, I. (2017). "Masked Autoregressive Flow for Density Estimation"
arXiv:1705.07057
Autoregressive flows for maximum expressiveness

Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019). "Neural Spline Flows"
arXiv:1906.04032
Monotonic rational-quadratic splines for flexible transformations

Continuous and Modern Flows¤

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). "Neural Ordinary Differential Equations"
arXiv:1806.07366
Continuous-time flows using ODE solvers

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., & Duvenaud, D. (2019). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models"
arXiv:1810.01367
Tractable continuous flows with Hutchinson's estimator

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). "Flow Matching for Generative Modeling"
arXiv:2210.02747
Simulation-free training paradigm

Liu, X., Gong, C., & Liu, Q. (2022). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
arXiv:2209.03003
Straight paths for one-step generation

Conditional Flow Matching and Couplings (2023–2024)¤

Tong, A., et al. (2024). "Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport" (TMLR)
arXiv:2302.00482
Conditional / OT-CFM / SB-CFM — the practical flow-matching recipe and TorchCFM library.

Albergo, M. S., & Vanden-Eijnden, E. (2023). "Building Normalizing Flows with Stochastic Interpolants"
arXiv:2303.08797
Stochastic interpolants — generalises FM with stochastic couplings; bridges ODEs and SDEs.

Albergo, M. S., et al. (2024). "Stochastic Interpolants: A Unifying Framework for Flows and Diffusions"
arXiv:2310.03725
Single objective subsuming flow matching + score matching.

Holderrieth, P., et al. (2024). "Generator Matching"
arXiv:2410.20587
Unifying objective subsuming flow / score / discrete-flow training.

One-Step Generation: Mean Flows and TarFlow (2024–2026)¤

Geng, Z., Deng, M., Bai, X., Kolter, J. Z., & He, K. (2025). "Mean Flows for One-step Generative Modeling" (NeurIPS Oral)
arXiv:2505.13447
Average-velocity training; FID 3.43 at 1 NFE on ImageNet 256² with no distillation.

(2025–2026). "AlphaFlow: Understanding and Improving MeanFlow Models"
arXiv:2510.20771
Variance analysis and improvements for MeanFlow; AlphaFlow extension.

(2026). "One-step Latent-free Image Generation with Pixel Mean Flows"
arXiv:2601.22158
Pixel-space one-step image generation via Mean Flows.

Zhai, S., et al. (2024). "Normalizing Flows are Capable Generative Models" (TarFlow)
arXiv:2412.06329
Transformer-based flow matching diffusion quality at one NFE.

Boffi, N. M., et al. (2024). "Flow Map Matching with Stochastic Interpolants: A Mathematical Framework for Consistency Models"
arXiv:2406.07507
Unifies consistency models, mean flows, and rectified-flow distillation under a single flow-map objective.

Production Flow Matching and Guidance (2024–2026)¤

Esser, P., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3, MMDiT)
arXiv:2403.03206
Rectified-flow MMDiT with Logit-Normal $t$ sampling; production-scale.

Black Forest Labs (2024). "FLUX.1: Open-weights Rectified-flow Transformer"
github.com/black-forest-labs/flux
12B-parameter open-weights flow transformer; FLUX.1 [schnell] is 1–4-step distilled.

Fan, W., et al. (2025). "CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models"
arXiv:2503.18886
Optimised guidance scale + zero-init for early ODE steps; consistently beats CFG on SD3, Flux, Wan-2.1.

(2025). "Inference-Time Compute Scaling for Flow Matching"
arXiv:2510.17786
First inference-time compute scaling for FM; generalises to protein design and folding.

(2026). "Stable Velocity: A Variance Perspective on Flow Matching"
arXiv:2602.05435
StableVM (variance-reduced training), VA-REPA, StableVS finetuning-free acceleration.

(2026). "Multitask Learning with Stochastic Interpolants"
arXiv:2508.04605
Operator-valued interpolants; one model performs multiple generative tasks.

(2026). "Flowception: Temporally Expansive Flow Matching for Video Generation"
arXiv:2512.11438
Variable-length non-autoregressive video flow matching with stochastic frame insertion.

Theory and Analysis¤

(NeurIPS 2024). "An Error Analysis of Flow Matching for Deep Generative Modeling"
openreview vES22INUKm
First complete theoretical error analysis of flow-matching CNFs.

(2024–25). "On the Guidance of Flow Matching"
openreview pKaNgFzJBy
First general framework for guidance techniques applicable to flow matching beyond CFG.

(NeurIPS 2024). "Local Flow Matching Generative Models"
openreview MM197t8WlM
Local FM construction for tighter generation guarantees.

(NeurIPS 2024). "Wasserstein Flow Matching: Generative Modeling over Families of Distributions"
openreview HB4lr0ykTi
Flow matching in Wasserstein space.

Discrete Flow Matching (2024–2026)¤

Gat, I., et al. (2024). "Discrete Flow Matching" (NeurIPS Spotlight)
arXiv:2407.15595
CTMC-based flows for discrete data (text, code, molecules).

Campbell, A., et al. (2024). "Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design" (Multiflow, NeurIPS)
arXiv:2402.04997
Multimodal discrete flows; protein sequence + structure co-design.

Davis, O., et al. (2024). "Fisher Flow Matching for Generative Modeling over Discrete Data"
arXiv:2405.14664
Geometric (Fisher–Rao) discrete flow matching.

Ho, Q., et al. (2025). "DeFoG: Discrete Flow Matching for Graph Generation" (ICML)
icml.cc/virtual/2025/poster/45644
SOTA molecular graph synthesis via DFM.

Riemannian / Geometric Flows¤

Chen, R. T. Q., & Lipman, Y. (2024). "Riemannian Flow Matching on General Geometries"
arXiv:2302.03660
Flows on manifolds for geometric data.

Survey References¤

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). "Normalizing Flows for Probabilistic Modeling and Inference"
arXiv:1912.02762 | JMLR 22(57):1-64, 2021
Complete tutorial covering theory and methods.

Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). "Normalizing Flows: An Introduction and Review of Current Methods" (IEEE TPAMI)
arXiv:1908.09257
Pre-flow-matching introduction with taxonomy.

(Nature npj AI, 2025). "Flow Matching Meets Biology and Life Science: A Survey"
arXiv:2507.17731 | Nature
Recent comprehensive survey: variant taxonomy + biological applications.

(Nature MI, 2026). "Flow Matching for Generative Modelling in Bioinformatics and Computational Biology"
Nature MI
Theoretical foundations + computational biology applications.

Awesome Flow Matching (Stochastic Interpolant)
github.com/dongzhuoyao/awesome-flow-matching
Continuously-updated curated list — the field-tracking resource.

TorchCFM — Conditional Flow Matching Library
github.com/atong01/conditional-flow-matching
Reference implementation of vanilla / OT / Schrödinger CFM.

Online Resources¤

Lilian Weng's Blog: "Flow-based Deep Generative Models"
lilianweng.github.io/posts/2018-10-13-flow-models
Complete blog post with clear explanations and visualizations

Eric Jang's Tutorial
blog.evjang.com/2018/01/nf1.html
Two-part tutorial with code

UvA Deep Learning Tutorial 11
uvadlc-notebooks.readthedocs.io
Complete Colab notebooks

awesome-normalizing-flows
github.com/janosh/awesome-normalizing-flows
Curated list with 700+ papers

Direction	Complexity	Use Case
Forward (density)	\(O(1)\) passes	All dimensions computed in parallel
Inverse (sampling)	\(O(D)\) passes	Sequential computation required