Variational Autoencoders (VAEs) Explained¤
-
Probabilistic Framework
Learn distributions over latent codes rather than deterministic encodings
-
Structured Latent Space
Continuous, smooth latent space enabling interpolation and controlled generation
-
Principled Generation
Sample from learned prior distribution to generate new, realistic data
-
Differentiable Training
End-to-end optimization using the reparameterization trick
New here?
For a one-page map of how VAEs fit next to GANs, Diffusion, Flows, EBMs, and Autoregressive models, start with Generative Models — A Unified View. This page is the deep-dive on VAEs specifically.
Overview¤
Variational Autoencoders (VAEs), introduced concurrently by Kingma & Welling (2014) and Rezende, Mohamed & Wierstra (2014), are a class of deep generative models that combine neural networks with variational inference to learn probabilistic representations of data. Unlike standard autoencoders that learn deterministic mappings, VAEs learn probability distributions over latent representations, enabling principled data generation and interpretable latent spaces. For a modern, authoritative tutorial see Kingma & Welling (2019).
What makes VAEs special?
VAEs solve a fundamental challenge in generative modeling: how to learn a structured, continuous latent space that can be sampled to generate new, realistic data. By imposing a probabilistic structure through variational inference, VAEs create smooth latent spaces where:
- Interpolation works naturally - moving between two points in latent space produces meaningful intermediate outputs
- Random sampling generates valid data - sampling from the prior produces realistic new samples
- The representation is interpretable - the latent space has structure that can be understood and controlled
The Intuition: Compression and Blueprints¤
Think of VAEs like an architect creating blueprints:
-
The Encoder compresses a complex house (your data) into essential instructions (latent vector), capturing key features—number of rooms, architectural style, materials—while discarding minor details like exact nail positions.
-
The Latent Space is a structured blueprint repository where similar designs cluster together: ranch houses near each other, Victorian mansions in another region, modern apartments elsewhere.
-
The Decoder rebuilds houses from blueprints, reconstructing recognizable structures though minor details differ from the original.
The critical distinction: VAEs encode to probability distributions, not single points. Each house maps to a probability cloud of similar blueprints, ensuring the latent space remains smooth and continuous. This enables generation—sample a random blueprint from the structured space, and the decoder builds a valid house, even one never seen before.
Mathematical Foundation¤
The Generative Story¤
VAEs model the data generation process as a two-step procedure:
- Sample latent code: \(z \sim p(z)\) from a simple prior distribution (typically standard normal)
- Generate data: \(x \sim p_\theta(x|z)\) using a decoder network parameterized by \(\theta\)
The goal is to learn parameters \(\theta\) (decoder) and \(\phi\) (encoder) that maximize the likelihood of observed data \(p_\theta(x)\).
graph LR
subgraph "True Generative Model"
A["Prior p(z)<br/>𝒩(0,I)"] --> B["Decoder p(x|z)"]
B --> C["Data x"]
end
subgraph "Inference Model"
C2["Data x"] --> D["Encoder q(z|x)"]
D --> E["Approximate<br/>Posterior"]
end
style A fill:#e1f5ff
style B fill:#fff3e0
style D fill:#f3e5f5
Variational Inference: Why We Need Approximation¤
Variational inference (Jordan et al., 1999; see Blei et al., 2017 for a modern review) replaces an intractable inference problem with a tractable optimisation problem.
The true posterior \(p_\theta(z|x)\) tells us what latent code likely generated our data. However, computing it requires:
The integral in the denominator (the evidence \(p_\theta(x)\)) is intractable for high-dimensional \(z\)—we'd need to integrate over all possible latent codes. VAEs sidestep this by learning an approximate posterior \(q_\phi(z|x)\) (the encoder) that's easy to compute.
The ELBO: Evidence Lower BOund¤
The key insight of VAEs is to maximize a tractable lower bound on the log-likelihood called the Evidence Lower BOund (ELBO):
This inequality states that the log-likelihood is always at least as large as the ELBO. The gap between them equals exactly \(D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))\)—when our approximate posterior perfectly matches the true posterior, there's no gap and we achieve the true likelihood.
Derivation from First Principles¤
Starting with the log-likelihood and introducing our approximate posterior:
Applying Jensen's inequality (since log is concave):
We get:
Expanding \(p_\theta(x, z) = p_\theta(x|z)p(z)\):
Two Interpretable Terms¤
The ELBO naturally decomposes into two competing objectives:
- Reconstruction Term: \(\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]\)
- Measures how well we can reconstruct the input from sampled latent codes
- Encourages the model to preserve information
-
Higher is better (less negative)
-
KL Divergence: \(D_{\text{KL}}(q_\phi(z|x) \| p(z))\)
- Measures how close our learned encoding is to the prior
- Regularizes the latent space to be smooth and structured
- Prevents "cheating" by spreading encodings arbitrarily far apart
- Lower is better (closer to prior)
The fundamental trade-off: The reconstruction term wants to encode all information to perfectly reconstruct. The KL term wants to compress encodings to match the simple prior. Training finds the optimal balance, creating a structured latent space that retains essential information while remaining smooth for generation.
Architecture Components¤
Encoder: Variational Posterior \(q_\phi(z|x)\)¤
The encoder is a neural network that maps inputs to parameters of a probability distribution over latent codes:
For a diagonal Gaussian (most common choice), the encoder outputs:
- Mean \(\mu_\phi(x) \in \mathbb{R}^d\) - the center of the latent distribution
- Log-variance \(\log \sigma^2_\phi(x) \in \mathbb{R}^d\) - the spread/uncertainty
graph LR
A["Input x<br/>(e.g., 28×28 image)"] --> B["Encoder Network<br/>(Conv layers or FC)"]
B --> C["Mean μ<br/>(d dimensions)"]
B --> D["Log-variance log σ²<br/>(d dimensions)"]
C --> E["Latent Distribution<br/>𝒩(μ, σ²I)"]
D --> E
style B fill:#f3e5f5
style E fill:#e8eaf6
Why output log-variance? Numerical stability. Variance must be positive, and learning \(\log \sigma^2\) allows the network to output any real number while ensuring \(\sigma^2 = \exp(\log \sigma^2) > 0\).
Why diagonal covariance? Full covariance matrices require \(O(d^2)\) parameters and are harder to optimize. Diagonal covariance assumes independence between dimensions, requiring only \(O(d)\) parameters while working well in practice.
Decoder: Likelihood \(p_\theta(x|z)\)¤
The decoder is a neural network that maps latent codes back to data space:
The choice of output distribution depends on your data:
- Gaussian (continuous): For real-valued images (often simplified to MSE loss with fixed variance)
- Bernoulli (binary): For binary images or features (use sigmoid + BCE loss)
- Categorical: For discrete data (use softmax + cross-entropy)
graph LR
A["Latent z<br/>(d dimensions)"] --> B["Decoder Network<br/>(Transposed Conv or FC)"]
B --> C["Reconstruction x̂ = μ_θ(z)<br/>(same shape as input)"]
style B fill:#fff3e0
style C fill:#e8f5e9
The Reparameterization Trick¤
The Problem: Backpropagation Through Sampling¤
We need to compute gradients of \(\mathbb{E}_{q_\phi(z|x)}[f(z)]\) with respect to \(\phi\). Naively sampling \(z \sim q_\phi(z|x)\) and computing \(\nabla_\phi f(z)\) doesn't work because the sampling operation itself depends on \(\phi\) but isn't differentiable.
The Solution: Separate Randomness from Parameters¤
Instead of sampling \(z\) directly from \(q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))\), reparameterize as:
where \(\odot\) denotes element-wise multiplication.
graph TD
A["Input x"] --> B["Encoder"]
B --> C["μ (mean)"]
B --> D["σ (std dev)"]
E["ε ~ 𝒩(0,I)<br/>(random noise)"] --> F["z = μ + σ ⊙ ε"]
C --> F
D --> F
F --> G["Decoder"]
G --> H["Reconstruction x̂"]
style F fill:#ffebee
style E fill:#e1f5ff
Why this works:
- The randomness (\(\epsilon\)) is now independent of our parameters \(\phi\)
- Gradients flow through the deterministic operations \(\mu_\phi\) and \(\sigma_\phi\)
- The expectation becomes \(\mathbb{E}_{p(\epsilon)}[f(g_\phi(\epsilon, x))]\) where \(g_\phi\) is deterministic
- We can approximate this expectation with Monte Carlo sampling: sample \(\epsilon\), compute gradients, average
This idea — proposed simultaneously by Kingma & Welling, 2014 and Rezende et al., 2014 — enabled practical VAE training and has since become fundamental to probabilistic deep learning.
Loss Function and Training¤
The VAE loss is derived directly from the negative ELBO:
Practical Implementation¤
For Gaussian encoder \(q_\phi(z|x) = \mathcal{N}(\mu, \sigma^2\mathbf{I})\) and standard normal prior \(p(z) = \mathcal{N}(0, \mathbf{I})\):
Reconstruction Loss (assuming Gaussian decoder with fixed variance):
KL Divergence (closed-form for Gaussians; see Kingma & Welling, 2014, Appendix B):
Total Loss:
Training Algorithm¤
import jax
import jax.numpy as jnp
from flax import nnx
import optax
# Create optimizer (wrt=nnx.Param required in NNX 0.11.0+)
optimizer = nnx.Optimizer(vae, optax.adam(1e-3), wrt=nnx.Param)
for epoch in epochs:
for batch in dataloader:
def loss_fn(vae):
# Forward pass
mu, log_var = vae.encoder(batch)
# Reparameterization trick
epsilon = jax.random.normal(rng_key, mu.shape)
z = mu + jnp.exp(0.5 * log_var) * epsilon
# Decode
x_recon = vae.decoder(z)
# ELBO terms — use a *consistent reduction*: sum over non-batch
# axes (pixels for recon, latent dims for KL), then mean over the
# batch axis. Mixing jnp.mean for recon with jnp.sum for KL is the
# single most common bug in VAE code: it leaves the two terms on
# wildly different scales and the KL silently dominates.
recon_loss = jnp.mean(
jnp.sum((x_recon - batch) ** 2, axis=tuple(range(1, batch.ndim)))
)
kl_per_sample = -0.5 * jnp.sum(
1 + log_var - mu ** 2 - jnp.exp(log_var), axis=-1
)
kl_loss = jnp.mean(kl_per_sample)
return recon_loss + kl_loss
# Gradient update (NNX 0.11.0+ API)
loss, grads = nnx.value_and_grad(loss_fn)(vae)
optimizer.update(vae, grads)
Key Training Metrics to Monitor¤
- Reconstruction loss: Should decrease steadily (lower = better reconstruction)
- KL divergence: Should stabilize at a positive value (5-20 is typical for well-trained models)
- ELBO: Combination of both, the primary metric
- Per-dimension KL: Helps detect posterior collapse (all values near 0 indicates problem)
VAE Variants¤
β-VAE: Disentangled Representations¤
β-VAE (Higgins et al., 2017) modifies the objective to encourage disentanglement, where individual latent dimensions capture independent factors of variation:
Effect of β:
- β = 1: Standard VAE (no additional emphasis on disentanglement)
- β > 1: Stronger regularization → encourages independent latent dimensions, improves disentanglement, but reduces reconstruction quality
- β < 1: Weaker regularization → better reconstruction, less structured latent space
graph LR
subgraph "β < 1: Reconstruction Focus"
A1[Sharp Images] --> B1[Entangled Latents]
end
subgraph "β = 1: Standard VAE"
A2[Balanced] --> B2[Some Structure]
end
subgraph "β > 1: Disentanglement Focus"
A3[Blurrier Images] --> B3[Disentangled Latents]
end
style A1 fill:#c8e6c9
style A3 fill:#ffccbc
style B3 fill:#c8e6c9
Practical β values: Start with β=1, try β=4-10 for image disentanglement tasks (dSprites, CelebA), use β=0.1-0.5 for text (to avoid posterior collapse). Burgess et al. (2018) further analyse β-VAE through an information-theoretic capacity-control lens and propose annealing the KL capacity \(C\) rather than \(β\) itself.
Applications:
- Interpretable representations for analysis and visualization
- Fair AI by removing sensitive attributes from representations
- Controllable generation by manipulating specific latent factors
β-TCVAE and FactorVAE: Targeting Total Correlation¤
β-VAE penalises the full KL term, which conflates two effects: pushing each posterior toward the prior, and pushing the aggregated posterior \(q_\phi(z)=\mathbb{E}_{p_d(x)}q_\phi(z\mid x)\) toward a factorized distribution. Only the second effect actually drives disentanglement. Two variants isolate it:
- FactorVAE (Kim & Mnih, 2018) trains a discriminator to estimate the total correlation \(\mathrm{TC}(z) = D_{\mathrm{KL}}(q_\phi(z)\,\|\,\prod_j q_\phi(z_j))\) and adds a \(\gamma\cdot\mathrm{TC}\) penalty on top of the standard ELBO.
- β-TCVAE (Chen et al., 2018) decomposes the KL term in closed form into mutual information, total correlation, and dimension-wise KL, and re-weights only the TC component:
β-TCVAE matches β-VAE's disentanglement gains without sacrificing reconstruction, and the same paper introduces the Mutual Information Gap (MIG) disentanglement metric.
Conditional VAE (CVAE)¤
Conditional VAEs (Sohn et al., 2015) incorporate additional information \(y\) (class labels, attributes, text descriptions) to enable controlled generation:
graph TD
A["Input x"] --> E["Encoder"]
B["Condition y<br/>(e.g., class label)"] --> E
E --> C["Latent z"]
C --> D["Decoder"]
B --> D
D --> F["Reconstruction x̂"]
style B fill:#fff9c4
style E fill:#f3e5f5
style D fill:#fff3e0
How conditioning works:
- Concatenation: Append \(y\) to the input before encoding and to \(z\) before decoding — simple but less expressive than feature-wise modulation.
- Conditional Batch Normalization (De Vries et al., 2017; Dumoulin et al., 2017): Replace BN's affine parameters with \(y\)-dependent values.
- FiLM (Feature-wise Linear Modulation) (Perez et al., 2018): Scale and shift each feature channel by \(y\)-dependent \((\gamma, \beta)\). Used pervasively in modern conditional generators.
- Cross-attention: For sequence/text conditions, attend from feature maps onto an encoded \(y\) — the technique that powers Stable Diffusion's text-to-image conditioning.
Applications:
- Class-conditional generation: Generate specific digit classes in MNIST or specific ImageNet classes (Mirza & Osindero, 2014).
- Attribute manipulation: Change hair color, age, expression in face images.
- Text-to-image: Generate images matching text descriptions, as in DALL·E 1 (Ramesh et al., 2021) and the broader latent-diffusion family.
Vector Quantized VAE (VQ-VAE)¤
VQ-VAE (van den Oord et al., 2017) replaces continuous latent representations with discrete codes from a learned codebook:
where \(\mathcal{C} = \{e_1, ..., e_K\}\) is a learned codebook of \(K\) embedding vectors.
graph TD
A["Input x"] --> B["Encoder"]
B --> C["Continuous z_e"]
C --> D["Vector<br/>Quantization"]
E["Learned<br/>Codebook"] --> D
D --> F["Discrete z_q"]
F --> G["Decoder"]
G --> H["Reconstruction x̂"]
style D fill:#ffebee
style E fill:#e1f5ff
VQ-VAE Loss Function:
where \(sg[\cdot]\) is the stop-gradient operator. The three terms are:
- Reconstruction loss — standard pixel-wise error.
- Codebook loss — pulls each codebook entry \(e_k\) toward the encoder outputs that selected it (gradient flows only into the codebook). The original paper uses this gradient-based update; many follow-ups — VQ-VAE-2 (Razavi et al., 2019), VQGAN (Esser et al., 2021) — replace it with an exponential-moving-average update for better stability.
- Commitment loss — pulls the encoder output toward its assigned codebook entry, with weight \(\beta\) (typically 0.25); without it the encoder output can drift arbitrarily.
Key advantages:
- ✅ No continuous posterior collapse — discrete codes cannot collapse to a degenerate Gaussian, but they have their own failure mode (codebook collapse: only a few entries get used). Tricks like EMA updates, codebook reset of dead entries, \(\ell_2\)-normalised codes (Yu et al., 2022 — ViT-VQGAN), and finite scalar quantisation (Mentzer et al., 2024 — FSQ) materially reduce it.
- ✅ Sharp reconstructions — the discreteness side-steps the Gaussian-decoder blur.
- ✅ Foundation for two-stage generation — pair the VQ-VAE with a powerful autoregressive or diffusion prior over the discrete codes (DALL·E 1, Parti, MaskGIT, MAGVIT-v2).
Applications:
- DALL·E 1 (Ramesh et al., 2021) — text-to-image with a transformer over VQ-VAE codes. (DALL·E 2 / 3 use diffusion instead, not VQ-VAE.)
- Jukebox (Dhariwal et al., 2020) — hierarchical VQ-VAE for raw-audio music generation.
- VQ-GAN (Esser et al., 2021) — adds adversarial + perceptual losses to VQ-VAE; the autoencoder template that became Stable Diffusion's tokenizer.
- Speech / audio tokenizers — SoundStream, EnCodec, and the residual VQ family use the same recipe for neural codecs.
Training Dynamics and Common Challenges¤
Posterior Collapse¤
What is it?
The encoder learns to ignore the input, producing latent codes that are essentially identical to the prior \(q_\phi(z|x) \approx p(z)\). The decoder learns to generate data without using latent information, defeating the purpose of the model.
How to detect:
- KL divergence ≈ 0 across all dimensions
- Random samples from prior produce diverse outputs, but encoding-decoding produces generic/blurry results
- Reconstructions don't match inputs well despite low reconstruction loss
Why does it happen?
Powerful autoregressive decoders (especially in text VAEs) can model \(p(x)\) without needing latent information. The KL term drives encodings toward the prior, and if the decoder doesn't need \(z\), the KL term wins.
graph TD
A[Strong Decoder] --> B{Can generate<br/>without z?}
B -->|Yes| C[Ignores Latent Code]
B -->|No| D[Uses Latent Code]
C --> E[Posterior Collapse]
D --> F[Healthy Training]
G[KL Annealing] --> D
H[Weak Decoder] --> D
I[Free Bits] --> D
style E fill:#ffccbc
style F fill:#c8e6c9
Solutions ranked by effectiveness:
- KL Annealing (CRITICAL for text): Start with β=0 and ramp up so the encoder produces useful codes before the KL penalty starts pushing them toward the prior.
- Linear (Bowman et al., 2015):
β = min(1.0, step / warmup_steps). The original VAE-text recipe; warmup of 10–40 % of training is typical. - Cyclical (Fu et al., 2019): repeat the 0→1 ramp \(M\) times. Each cycle warm-restarts from the previous solution and consistently improves NLP benchmarks vs. a single linear schedule.
-
Sigmoid / monotonic schedules are also common; the choice matters less than ensuring β stays below 1 long enough to learn informative codes.
-
Free Bits (Kingma et al., 2016): only penalise KL above a per-group threshold \(\lambda\) (typically 0.5–2.0 nats). This guarantees each latent dimension carries at least λ nats of information about \(x\):
\[\widetilde{\mathcal{L}}_{\mathrm{KL}} = \sum_g \max\!\big(\lambda,\ D_{\mathrm{KL}}(q_\phi(z_g\mid x)\|p(z_g))\big)\] -
δ-VAE (Razavi et al., 2019): replace the standard normal prior with a constrained family (e.g., AR-1 latent dynamics) that has a strict lower bound on KL, eliminating the collapsed solution from the optimisation landscape entirely.
-
β-VAE with β < 1: Reduce KL penalty (β=0.1-0.5 for text). Simple but blurs the structured-latent guarantees of β=1.
-
Word Dropout (for text): Randomly replace 25-50% of input words with
<UNK>so the decoder cannot rely on autoregressive shortcuts. -
Weakening the Decoder: Use a less expressive decoder (smaller LSTM, dilated CNN over autoregressive transformer) so it cannot ignore \(z\).
Blurry Reconstructions¤
Why it happens:
MSE loss encourages the decoder to output \(\mathbb{E}[x|z]\), the average of all plausible outputs. Averaging sharp images produces blur—this is a fundamental consequence of the Gaussian likelihood assumption, not a bug.
Solutions:
- Perceptual Loss (LPIPS — Zhang et al., 2018): Replace pixel-wise MSE with VGG/AlexNet feature matching.
- Significantly improves sharpness while maintaining structure.
-
Used in DFC-VAE (Hou et al., 2017) and inside the SD-VAE training recipe.
-
Adversarial Training: Add a discriminator to penalise unrealistic outputs.
- VAE-GAN (Larsen et al., 2016) is the canonical recipe.
- The Stable Diffusion VAE adds a PatchGAN (Isola et al., 2017) discriminator on top of LPIPS + L1.
-
Combines reconstruction, KL, and adversarial losses.
-
Multi-scale SSIM (Wang et al., 2003): Structural-similarity loss across multiple resolutions, better correlated with perceptual quality than MSE.
-
VQ-VAE / VQGAN (van den Oord et al., 2017; Esser et al., 2021): Discrete latents and adversarial training together produce sharper outputs.
-
Learned Variance: Let the decoder predict per-pixel variance \(\sigma^2_\theta(z)\) instead of using a fixed σ² — recovers a proper Gaussian likelihood and lets the model express its own uncertainty.
Optimization Challenges¤
NaN losses:
- Check activation functions: ensure Sigmoid on decoder output for [0,1] images
- Add gradient clipping:
grads = jax.tree.map(lambda g: jnp.clip(g, -1.0, 1.0), grads) - Use Softplus for log_var:
log_var = nnx.softplus(log_var_raw) + 1e-6 - Reduce learning rate if gradients explode
Loss not decreasing:
- Verify loss signs: minimize negative ELBO
- Check data normalization: should be [0,1] or [-1,1]
- Ensure encoder-decoder dimension matching
- Monitor gradient norms: should be in range [0.1, 10]
Imbalanced loss terms:
- Reconstruction loss sums over many pixels; KL sums over few latent dimensions
- Solution: normalize by dimension count or manually weight with β
Advanced Topics¤
Hierarchical VAEs¤
Stack multiple layers of latent variables for richer, more structured representations:
The encoder is typically bidirectional — top-down conditional priors are combined with bottom-up posterior corrections, as in Ladder VAE (Sønderby et al., 2016) and IAF-VAE (Kingma et al., 2016).
Benefits:
- Coarse features (object class) at top levels
- Fine details (texture, color) at lower levels
- Better for complex, high-resolution data
State-of-the-art deep hierarchies:
- NVAE (Vahdat & Kautz, 2020) — 36 hierarchical groups, depthwise-separable convolutions, spectral regularisation; the first VAE to model 256×256 natural images.
- Very Deep VAE / VDVAE (Child, 2021) — pushes hierarchical depth to 70+ stochastic layers, beats PixelCNN log-likelihoods on CIFAR-10, ImageNet-32/64 and FFHQ-256 with far fewer parameters and orders-of-magnitude faster sampling.
- HQ-VAE (Takida et al., 2024) — generalises hierarchical discrete latent codebooks (the VQ side of the family) under a single variational Bayes objective, mitigating the codebook-collapse failure mode of stacked VQ-VAE.
Importance Weighted VAE (IWAE)¤
Burda et al. (2016) use multiple samples to get tighter bounds on the log-likelihood:
With \(K\) samples, IWAE provides a strictly tighter bound than standard VAE (K=1). Typical values: K=5-50.
Normalizing Flow VAE¤
Rezende & Mohamed (2015) replace the Gaussian posterior with flexible distributions via invertible transformations:
where \(f\) is an invertible function. Common flow families: Real NVP (Dinh et al., 2017), MAF (Papamakarios et al., 2017), IAF (Kingma et al., 2016) — well-suited as a posterior because sampling is fast — and Glow (Kingma & Dhariwal, 2018).
Benefits:
- Arbitrarily complex posterior distributions
- Better approximation of true posterior, tightening the ELBO
- Improved generation quality
Trade-off: Increased compute during training; some flows (MAF) are slow to sample, others (IAF) are slow to evaluate density.
VampPrior and Learned Priors¤
The standard \(\mathcal{N}(0, I)\) prior is mismatched with the aggregated posterior — most regions of the prior have low decoder mass, hurting unconditional samples. VampPrior (Tomczak & Welling, 2017) replaces \(p(z)\) with a mixture \(p_\lambda(z) = \tfrac{1}{K}\sum_{k} q_\phi(z\mid u_k)\) parameterised by \(K\) learned pseudo-inputs \(u_k\), sharply tightening the ELBO and improving sample quality at modest cost. Diffusion priors (Wehenkel & Louppe, 2021; Vahdat et al., 2021 — LSGM) generalise this further by training a diffusion model in the latent space.
VAEs Inside Modern Generative Pipelines¤
In 2026, the most economically important use of VAEs is no longer end-to-end generation — it is perceptual compression for diffusion, flow, and autoregressive generators. Stable Diffusion (Rombach et al., 2022), SDXL (Podell et al., 2024), SD3, FLUX.1, Wan, Cosmos (NVIDIA, 2025), and many open video stacks share the same broad recipe: an autoencoder / VAE-like codec compresses pixels into a low-dimensional latent, and a diffusion, flow-matching, or autoregressive prior is trained in that latent space. Sora's public report confirms compressed spacetime patch latents but does not specify a public VAE architecture.
Latent Diffusion Models (LDMs)¤
graph LR
A[Pixel x] --> B[VAE Encoder]
B --> C[Latent z]
C --> D[Diffusion in z-space]
D --> E[Denoised z']
E --> F[VAE Decoder]
F --> G[Pixel x']
style B fill:#f3e5f5
style F fill:#fff3e0
style D fill:#e1f5ff
The VAE here is typically a KL-regularised autoencoder (the "KL-f8" SD-VAE: 8× spatial downsampling, 4 channels, mild KL term) trained with the recipe introduced by Rombach et al. (2022): an L1 reconstruction term, an LPIPS perceptual loss, and a PatchGAN adversarial loss — the loss design is inherited from VQGAN (Esser et al., 2021) but the discrete codebook is replaced by a continuous Gaussian latent with a small KL penalty (≈10⁻⁶). The KL term keeps the latent close to a unit-Gaussian prior shape; what changes versus a classical VAE is that the generative prior is now learned — a diffusion / flow model in \(z\) — rather than fixed standard-normal. Newer image codecs often use wider latent channel counts than the original SD-VAE, and open video systems extend the idea to 3-D causal-convolutional video autoencoders.
Recent Tokenizer-VAE Advances (2024–2026)¤
| Model | Year | Contribution |
|---|---|---|
| LiteVAE (Sadat et al., 2024) | 2024 | 2-D discrete wavelet transform inside the encoder; ~6× fewer parameters than SD-VAE at matching rFID. |
| CV-VAE (Zhao et al., 2024) | 2024 | Video VAE whose latent is latent-compatible with a frozen image VAE — train video diffusion on top of an image-pretrained UNet without re-tuning. |
| WF-VAE (Li et al., 2024) | 2024 | Multi-level wavelet flow that routes low-frequency energy directly into the latent; 2× throughput, 4× lower memory than prior video VAEs. |
| IV-VAE (Wu et al., 2024) | 2024 | Keyframe temporal compression + group-causal 3-D convolutions for higher-compression video latents. |
| EQ-VAE (Kouzelis et al., 2025) | 2025 | Equivariance regularisation on the latent under semantic-preserving transforms (scale, rotation); 7× faster DiT-XL training when used as a fine-tune of SD-VAE. |
| MAETok (Chen et al., 2025) | 2025 | Drops the variational constraint entirely and trains the tokenizer as a Masked Autoencoder; argues that latent structure, not the KL term, is what diffusion priors actually need. |
| DiTo (Chen et al., 2025) | 2025 | Diffusion autoencoders (encoder → quantised latent → diffusion decoder) that scale as image tokenizers. |
| VTP (arXiv:2512.13687, 2025) | 2025 | Decoupled pre-training of visual tokenizers from the diffusion prior; 1.11 gFID on ImageNet 256². |
| PH-VAE (arXiv:2603.01800, 2026) | 2026 | Phase-Type decoders (continuous-time Markov absorption times) model heavy-tailed data substantially better than Gaussian / Student-t / GEV decoders. |
| VFM-as-Tokenizer (arXiv:2510.18457, 2025) | 2025 | Use a frozen Vision Foundation Model (DINOv2, SigLIP) directly as the encoder — the discriminative latent geometry transfers cleanly to diffusion priors. |
| Latent Diffusion without a VAE (arXiv:2510.15301, 2025/26) | 2025/26 | Argues that VAE-induced latents are bottlenecks for diffusion training; explores VAE-free LDMs operating on patch tokens or VFM features. |
The throughline of this generation: the VAE used to be the generator; now it's the codec. Reconstruction fidelity (rFID, PSNR, LPIPS) and the geometry of the latent (smoothness, equivariance, semantic separability) matter much more than the absolute ELBO.
Diffusion–VAE Hybrids¤
Beyond the LDM "VAE-then-diffusion" pipeline, several hybrids tie the two more tightly:
- DiffuseVAE (Pandey et al., 2022) uses a VAE to produce a coarse reconstruction, then diffuses the residual between that reconstruction and the data — fast like a VAE but sharp like a diffusion model.
- LSGM (Vahdat et al., 2021) trains a diffusion model as the prior of an NVAE, jointly maximising a single ELBO.
- DVAE / Diffusion Decoders (Preechakul et al., 2022 — DiffAE) replace the Gaussian decoder with a conditional diffusion model, recovering near-GAN sample fidelity while keeping a usable encoder.
- Generalisation theory (Wang et al., 2025) gives a unified information-theoretic analysis of when VAE encoders + diffusion generators generalise, formalising what the empirical LDM literature has been observing.
Foundation-Model Tokenizers¤
A complementary 2025–2026 direction abandons the trained encoder altogether and uses a frozen Vision Foundation Model (DINOv2, SigLIP, MAE) as the encoder, training only a lightweight decoder (arXiv:2510.18457, 2025; arXiv:2509.25162, 2025). This trades end-to-end optimality for a far better-conditioned latent geometry, and converges with the LiteVAE / EQ-VAE story: the latent's structure, not the KL term, is what downstream priors actually consume.
Latent Space Properties and Interpretation¤
Continuity and Interpolation¤
A well-trained VAE has a continuous latent space where:
- Nearby points decode to similar outputs
- Linear interpolation produces smooth transitions
- The space is "covered" - no holes where sampling produces garbage
Testing interpolation:
# Encode two images
z1 = encoder(x1)[0] # Take mean, ignore variance
z2 = encoder(x2)[0]
# Interpolate
alphas = jnp.linspace(0, 1, num=10)
z_interp = [(1-α)*z1 + α*z2 for α in alphas]
# Decode interpolated points
x_interp = [decoder(z) for z in z_interp]
Disentanglement: Independent Factors of Variation¤
In a disentangled representation, each latent dimension captures a single, interpretable factor:
- \(z_1\): Object class (digit identity)
- \(z_2\): Rotation angle
- \(z_3\): Stroke width
- \(z_4\): Position
- ...
graph TD
subgraph "Disentangled Latent Space"
A["z₁: Rotation"] --> E["Decoder"]
B["z₂: Size"] --> E
C["z₃: Color"] --> E
D["z₄: Position"] --> E
end
E --> F["Generated Image"]
subgraph "Entangled Latent Space"
G["z₁: Mixed<br/>(rotation + size)"] --> H["Decoder"]
I["z₂: Mixed<br/>(color + position)"] --> H
end
H --> J["Generated Image"]
style E fill:#c8e6c9
style H fill:#ffccbc
Achieving disentanglement:
- Train with β-VAE (β > 1) for a quick lever; expect a recon-vs-disentanglement trade-off.
- Use β-TCVAE or FactorVAE when you want to reweight only the total-correlation component of the KL.
- Use structured datasets (dSprites, 3D Shapes, MPI3D) — disentanglement is essentially impossible to evaluate on natural images.
- Add supervision or weak supervision (paired examples, attribute labels): Locatello et al., 2019 proves that purely unsupervised disentanglement is fundamentally identifiable only up to inductive biases.
Measuring disentanglement:
- MIG (Mutual Information Gap) — gap between the top-2 latents most informative about each ground-truth factor; introduced with β-TCVAE.
- SAP (Separated Attribute Predictability) (Kumar et al., 2018) — gap between the top-2 latents that best predict each factor.
- DCI (Disentanglement, Completeness, Informativeness) (Eastwood & Williams, 2018) — three-metric framework based on a probe regressor's importance matrix.
- FactorVAE score (Kim & Mnih, 2018) — accuracy of a majority-vote classifier predicting which factor is held fixed in a batch.
Comparing VAEs with Other Generative Models¤
| Aspect | VAE | GAN | Diffusion | Normalizing Flow |
|---|---|---|---|---|
| Likelihood | Lower bound (ELBO) | Implicit | Tractable | Exact |
| Training Stability | Stable | Unstable | Stable | Stable |
| Sample Quality | Good (blurry) | Excellent (sharp) | Excellent | Good |
| Sampling Speed | Fast | Fast | Slow (50-1000 steps) | Fast |
| Latent Space | Structured, smooth | None (no encoder) | Gradual diffusion | Exact bijection |
| Mode Coverage | Excellent | Poor (mode collapse) | Excellent | Excellent |
| Architecture Constraints | Flexible | Flexible | Flexible | Invertible only |
When to Use VAEs¤
VAEs Excel When:
- You need structured latent representations for downstream tasks
- Training stability is more important than peak image quality
- You want both generation and reconstruction capabilities
- Interpretability matters (anomaly detection, representation learning)
- You're working with non-image data (text, graphs, molecules)
Example Applications:
- Medical image anomaly detection via reconstruction error (Zimmerer et al., 2019)
- Molecular design with controllable chemical properties (Gómez-Bombarelli et al., 2018)
- Semi-supervised learning with limited labels (Kingma et al., 2014 — M2 model)
- Data compression and denoising (Ballé et al., 2018)
- Recommendation systems (Liang et al., 2018 — Mult-VAE)
When to Use GANs¤
GANs Excel When:
- Image quality is paramount (super-resolution, photorealistic faces)
- You don't need an encoder (generation-only tasks)
- You're willing to handle training instability
- Mode coverage isn't critical
Limitations:
- No structured latent space for interpolation/arithmetic
- Training instability (mode collapse, oscillation)
- No reconstruction capability
When to Use Diffusion Models¤
Diffusion Models Excel When:
- You want state-of-the-art quality — e.g. DALL·E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), Stable Diffusion (Rombach et al., 2022).
- Computational cost is acceptable.
- You need excellent mode coverage and diversity (Ho et al., 2020 — DDPM; Song et al., 2021 — score-based).
Limitations:
- Slow sampling (typically 20–1000 iterative steps).
- Higher inference cost than VAEs/GANs.
- Almost always paired with a VAE in practice — Latent Diffusion Models are the dominant deployment pattern, and the role of the VAE is exactly what was discussed in the VAEs Inside Modern Generative Pipelines section.
Practical Implementation Guide¤
Architecture Recommendations¤
For Images (MNIST, CIFAR-10, CelebA):
# Encoder (using Flax NNX)
nnx.Conv(3, 32, kernel_size=(4, 4), strides=2) → nnx.BatchNorm → nnx.relu
nnx.Conv(32, 64, kernel_size=(4, 4), strides=2) → nnx.BatchNorm → nnx.relu
nnx.Conv(64, 128, kernel_size=(4, 4), strides=2) → nnx.BatchNorm → nnx.relu
Flatten → nnx.Linear(latent_dim × 2) → Split into μ and log(σ²)
# Decoder (mirror)
nnx.Linear(latent_dim, 128×4×4) → Reshape
nnx.ConvTranspose(128, 64, kernel_size=(4, 4), strides=2) → nnx.BatchNorm → nnx.relu
nnx.ConvTranspose(64, 32, kernel_size=(4, 4), strides=2) → nnx.BatchNorm → nnx.relu
nnx.ConvTranspose(32, 3, kernel_size=(4, 4), strides=2) → nnx.sigmoid
For Text/Sequential Data:
# Encoder (using Flax NNX)
nnx.Embed(vocab_size, embed_dim) → Bidirectional nnx.LSTM/nnx.GRU (2-3 layers)
→ Take final hidden state → nnx.Linear(latent_dim × 2)
# Decoder
Repeat latent vector for each timestep
→ nnx.LSTM/nnx.GRU → nnx.Linear(vocab_size) → nnx.softmax
Hyperparameter Recommendations¤
Latent Dimensions:
- MNIST (28×28): 2-20 dimensions
- CIFAR-10 (32×32): 128-256 dimensions
- CelebA (64×64): 256-512 dimensions
- Text (sentences): 32-128 dimensions
Learning Rates:
- Simple datasets (MNIST): 1e-3 to 5e-3
- Complex images: 1e-4 to 1e-3
- Text: 5e-4 to 1e-3
- Always use Adam or AdamW optimizer
Batch Sizes:
- 64-128 works well across domains
- Larger batches improve gradient estimates but require more memory
Training Epochs:
- MNIST: 50-100 epochs
- CIFAR-10/CelebA: 100-300 epochs
- Text: 50-200 epochs
Essential Training Techniques¤
- KL Annealing (CRITICAL for text, helpful for images):
# Linear annealing
beta = min(1.0, epoch / 40)
loss = recon_loss + beta * kl_loss
# Cyclical annealing (BEST for NLP)
cycle_length = 10
t = epoch % cycle_length
if t <= 0.5 * cycle_length:
beta = t / (0.5 * cycle_length)
else:
beta = 1.0
- Numerical Stability:
# Clip the raw log-variance to a safe range. log_var ∈ [-7, 7] keeps
# σ² ∈ [9e-4, 1.1e3] — wide enough for any realistic posterior, narrow
# enough that exp(log_var) cannot overflow during training.
log_var = jnp.clip(log_var_raw, -7.0, 7.0)
sigma = jnp.exp(0.5 * log_var)
# (Alternative) parameterise σ directly with softplus and recover log_var.
# Don't apply softplus to log_var itself — that forces σ² ≥ 1.
# sigma = nnx.softplus(sigma_raw) + 1e-6
# log_var = 2.0 * jnp.log(sigma)
# Global gradient-norm clip is more stable than element-wise clip.
grads = optax.clip_by_global_norm(1.0).update(grads, optax.EmptyState())[0]
- Loss Balancing — match reductions across terms:
# Reduce both terms the same way: sum over non-batch axes, mean over batch.
recon_per_sample = jnp.sum((x_recon - x) ** 2, axis=tuple(range(1, x.ndim)))
kl_per_sample = jnp.sum(kl_per_dim, axis=-1)
loss = jnp.mean(recon_per_sample) + beta * jnp.mean(kl_per_sample)
Evaluation Metrics¤
VAEs are evaluated on three loosely-orthogonal axes: density-estimation quality, sample / reconstruction fidelity, and representation quality (since the encoder is half the model).
Density-Estimation Metrics¤
- Negative ELBO — the training objective; reported as bits per dimension (\(\mathrm{BPD} = -\log_2 p_\theta(x) / D\)) on image data, perplexity on text.
- IWAE bound (Burda et al., 2016) — tighter likelihood lower bound via \(K\) importance samples; standard for honest density-estimation comparisons.
- Active units — number of latent dimensions with \(D_\mathrm{KL}(q(z_j|x)\,\|\,p(z_j)) > 10^{-2}\) on average; diagnoses posterior collapse.
Sample / Reconstruction Quality¤
- Reconstruction error — pixel MSE on a held-out set; the simplest sanity check.
- rFID — reconstruction FID — Fréchet Inception Distance (Heusel et al., 2017) computed between real images and their reconstructions. The dominant 2024–2026 metric for evaluating tokenizer-VAEs (SD-VAE, LiteVAE, EQ-VAE).
- LPIPS (Zhang et al., 2018) — perceptual distance via VGG features; complements pixel MSE.
- gFID — generative FID — FID computed on unconditional samples \(x = \mathrm{Decoder}(z)\) with \(z \sim \mathcal{N}(0, I)\). Distinguishes a good codec from a good generator; a VAE can have low rFID but high gFID if the aggregated posterior is far from the prior.
- CMMD (Jayasumana et al., 2024) — CLIP-feature MMD; better correlated with human perception than FID for modern generators.
Representation / Latent-Space Quality¤
- Linear probing — train a linear classifier on \(\mu(x)\) for ImageNet / CIFAR-10 labels; tests whether the latent is semantically useful.
- MIG / DCI / SAP / FactorVAE score — disentanglement metrics for \(\beta\)-VAE / \(\beta\)-TCVAE / FactorVAE evaluations on dSprites / 3D-Shapes (see the Disentanglement section).
What to Report When¤
| Use case | Primary metric | Secondary |
|---|---|---|
| Latent-diffusion tokenizer (SD-VAE class) | rFID + LPIPS | PSNR, SSIM |
| Density estimation / scientific likelihood | BPD or IWAE NLL | Active units |
| Disentangled representation learning | MIG + DCI | Linear-probe accuracy |
| Standalone generative model | gFID + CMMD | rFID for codec quality |
Production Considerations¤
Inference Cost (the codec view)¤
In the dominant 2024–2026 deployment pattern, the VAE runs twice in a latent-diffusion pipeline: encoder once to start, decoder once at the end. With SD-style 8× spatial compression, a 1024² image becomes a 128² latent — the encoder/decoder typically account for 5–15 % of total inference time, the rest going to the diffusion prior.
| Model | Parameters | 1024² VAE-decode latency (single H100) |
|---|---|---|
| SD 1.5 VAE (KL-f8) | ~85 M | ~30 ms |
| SDXL VAE (KL-f8, fp16) | ~85 M | ~25 ms (FP16) |
| LiteVAE (Sadat et al., 2024) | ~14 M | ~10 ms |
Quantisation, Distillation, and Edge Deployment¤
- FP16 / BF16 — universally safe; halves memory with no measurable quality loss.
- INT8 PTQ — typically usable for the encoder; the decoder is more sensitive (artefacts on highlights and skin tones). Stable Cascade and FLUX schnell ship INT8 decoders for fast inference.
- TAESD (Berman, 2023) — a tiny distilled SD-VAE decoder (~2 M params) used for live previews in web UIs; runs in <5 ms on consumer GPUs.
- Mobile / edge — TFLite / Core ML conversions of TAESD-style distilled decoders enable on-device latent-diffusion pipelines (Apple Image Playground, Stable Diffusion Mobile).
Ethical and Safety Notes¤
VAE codecs encode training-data information into their reconstruction quality; this can leak details from the training set on close-to-distribution inputs. Best practice: train VAEs on the same legally-cleared data as the diffusion prior they will pair with, and pair every deployed VAE-decoder with the same NSFW / safety classifiers as the surrounding diffusion stack.
For the broader unified picture and how VAE codecs fit alongside diffusion / flow / GAN / EBM / AR systems, see Generative Models — A Unified View.
Summary and Key Takeaways¤
VAEs are powerful generative models that combine deep learning with variational inference to learn structured, interpretable latent representations. Understanding VAEs provides essential foundations for modern generative modeling, from Stable Diffusion's continuous latent space to the discrete VQ-VAE codes that powered DALL·E 1, Parti, and the modern wave of video tokenizers (Sora, MAGVIT-v2, Open-Sora).
Core Principles:
- ELBO objective balances reconstruction quality with latent space structure
- Reparameterization trick enables efficient gradient-based optimization
- Probabilistic framework creates smooth, continuous latent spaces suitable for generation
- Variational inference provides principled approximations to intractable posteriors
Key Variants:
- β-VAE / β-TCVAE / FactorVAE trade reconstruction for disentangled, interpretable representations
- VQ-VAE / VQGAN use discrete latents for improved quality and as the discrete-token foundation of DALL-E and similar systems
- Conditional VAE enables controlled generation with auxiliary information
- Hierarchical VAE (NVAE, VDVAE, HQ-VAE) captures multi-scale structure in complex data
- Latent-Diffusion VAEs (SD-VAE, LiteVAE, EQ-VAE, video VAEs) serve as perceptual codecs for diffusion priors — this is by far the most economically deployed VAE today
Best Practices:
- Use KL annealing, especially for text
- Monitor both reconstruction and KL losses during training
- Consider perceptual or adversarial losses for sharper images
- Apply appropriate architecture choices for your data modality
- Start simple, add complexity as needed
Next Steps¤
-
Practical usage guide with implementation examples and training workflows
-
Complete API documentation for VAE, β-VAE, CVAE, and VQ-VAE classes
-
Step-by-step hands-on tutorial: train a VAE on MNIST from scratch
-
Explore hierarchical VAEs, VQ-VAE applications, and multi-modal learning
Additional Readings¤
Seminal Papers (Must Read)¤
Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes"
arXiv:1312.6114
The original VAE paper introducing the framework and reparameterization trick
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models"
arXiv:1401.4082
Independent development of similar ideas with deep latent Gaussian models
Higgins, I., et al. (2017). "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework"
ICLR 2017
Introduces β-VAE for disentangled representations
Van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning"
arXiv:1711.00937
VQ-VAE for discrete latent representations
Tutorial Papers and Books¤
Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders"
arXiv:1906.02691
Authoritative modern tutorial by the original authors
Doersch, C. (2016). "Tutorial on Variational Autoencoders"
arXiv:1606.05908
Excellent intuitive introduction with minimal prerequisites
Ghojogh, B., et al. (2021). "Factor Analysis, Probabilistic PCA, Variational Inference, and VAE: Tutorial and Survey"
arXiv:2101.00734
Connects VAEs to classical dimensionality reduction methods
Important VAE Variants¤
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). "Importance Weighted Autoencoders"
arXiv:1509.00519
Tighter likelihood bounds using importance sampling
Burgess, C. P., et al. (2018). "Understanding Disentangling in β-VAE"
arXiv:1804.03599
Theory and practice of disentanglement in β-VAE
Sønderby, C. K., et al. (2016). "Ladder Variational Autoencoders"
arXiv:1602.02282
Hierarchical VAEs with bidirectional inference
Vahdat, A., & Kautz, J. (2020). "NVAE: A Deep Hierarchical Variational Autoencoder"
arXiv:2007.03898
State-of-the-art deep hierarchical VAE for high-resolution images
Rezende, D., & Mohamed, S. (2015). "Variational Inference with Normalizing Flows"
arXiv:1505.05770
Flexible posterior distributions using invertible transformations
Kingma, D. P., et al. (2016). "Improved Variational Inference with Inverse Autoregressive Flow"
arXiv:1606.04934
Scalable flexible posteriors for complex distributions
Tomczak, J., & Welling, M. (2017). "VAE with a VampPrior"
arXiv:1705.07120
Learned mixture-of-posteriors prior for better modeling
Makhzani, A., et al. (2015). "Adversarial Autoencoders"
arXiv:1511.05644
Combining VAEs with adversarial training
VAEs in Modern Generative Pipelines (2022–2026)¤
Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)
arXiv:2112.10752
Established the KL-VAE-then-diffusion recipe that underpins SD/SDXL/Flux
Esser, P., Rombach, R., & Ommer, B. (2021). "Taming Transformers for High-Resolution Image Synthesis" (VQGAN)
arXiv:2012.09841
VQ-VAE + adversarial + perceptual loss — the autoencoder template behind SD-VAE
Pandey, K., et al. (2022). "DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents"
arXiv:2201.00308
Two-stage VAE→diffusion residual refinement
Vahdat, A., Kreis, K., & Kautz, J. (2021). "Score-based Generative Modeling in Latent Space" (LSGM)
arXiv:2106.05931
Diffusion as the prior of an NVAE, jointly optimised with one ELBO
Sadat, A., et al. (2024). "LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models"
arXiv:2405.14477
2D wavelet transform inside the encoder, ~6× parameter reduction at matching rFID
Zhao, S., et al. (2024). "CV-VAE: A Compatible Video VAE for Latent Generative Video Models"
arXiv:2405.20279
Video VAE with image-VAE-compatible latent space
Wu, X., et al. (2024). "Improved Video VAE for Latent Video Diffusion Model" (IV-VAE)
arXiv:2411.06449
Keyframe temporal compression and group-causal 3D convolutions
Li, Y., et al. (2024). "WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow"
arXiv:2411.17459
2× throughput, 4× lower memory than prior video VAEs via wavelet routing
Kouzelis, T., et al. (2025). "EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling" (ICML 2025)
arXiv:2502.09509
Equivariance regulariser → 7× faster DiT-XL training; compatible with continuous and discrete autoencoders
Chen, H., et al. (2025). "Masked Autoencoders Are Effective Tokenizers for Diffusion Models" (MAETok)
arXiv:2502.03444
Drops the variational constraint; argues latent structure is what matters
Yu, S., et al. (2025). "Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models"
arXiv:2501.01423
Quantifies the rFID-vs-gFID trade-off in tokenizer VAE design
(2025). "Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models"
arXiv:2510.18457
Frozen DINOv2/SigLIP encoder + lightweight decoder rivals trained tokenizers
(2025–2026). "Latent Diffusion Model without Variational Autoencoder"
arXiv:2510.15301
Critiques the VAE bottleneck in LDMs and explores VAE-free alternatives
Wang, R., et al. (2025). "Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis"
arXiv:2506.00849
Joint generalisation theory for encoders + diffusion priors
Takida, Y., et al. (2024). "HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes" (TMLR)
openreview HQ-VAE
Unified variational treatment of stacked VQ-VAE codebooks
(2026). "Phase-Type Variational Autoencoders for Heavy-Tailed Data"
arXiv:2603.01800
Phase-Type decoders for tail-heavy distributions; outperforms Gaussian/Student-t/GEV
(2025). "Towards Scalable Pre-training of Visual Tokenizers for Generation"
arXiv:2512.13687
Decouples tokenizer pre-training from the diffusion prior; 1.11 gFID on ImageNet 256²
Posterior Collapse, Annealing, and Identifiability¤
Fu, H., et al. (2019). "Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing" (NAACL)
arXiv:1903.10145
The standard cyclical β schedule for text VAEs
Razavi, A., et al. (2019). "Preventing Posterior Collapse with delta-VAEs"
arXiv:1901.03416
Constrained-prior families with strict KL lower bound
Wang, Y., Blei, D., & Cunningham, J. (2021). "Posterior Collapse and Latent Variable Non-identifiability" (NeurIPS 2021)
arXiv:2301.00537
Identifies non-identifiability — not just KL pressure — as a root cause of collapse
Lucas, J., et al. (2019). "Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse"
arXiv:1911.02469
Shows collapse can arise from optimisation, not the bound itself
Chen, R. T. Q., et al. (2018). "Isolating Sources of Disentanglement in Variational Autoencoders" (β-TCVAE, NeurIPS)
arXiv:1802.04942
KL decomposition into MI / TC / dim-wise KL; introduces the MIG metric
Kim, H., & Mnih, A. (2018). "Disentangling by Factorising" (FactorVAE, ICML)
arXiv:1802.05983
Adversarially-estimated total-correlation penalty for disentanglement
Child, R. (2021). "Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images" (VDVAE, ICLR)
arXiv:2011.10650
70+ stochastic layers; beats PixelCNN on natural-image NLL
Locatello, F., et al. (2019). "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (ICML, best paper)
arXiv:1811.12359
Impossibility result: unsupervised disentanglement requires inductive biases or weak supervision
Application Papers¤
Bowman, S. R., et al. (2015). "Generating Sentences from a Continuous Space"
arXiv:1511.06349
VAEs for text generation (pioneering work)
Sohn, K., Lee, H., & Yan, X. (2015). "Learning Structured Output Representation using Deep Conditional Generative Models"
NeurIPS 2015
Conditional VAE framework
Gomez-Bombarelli, R., et al. (2018). "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules"
ACS Central Science
VAEs for molecular design and drug discovery
Online Resources and Code¤
Lil'Log: From Autoencoder to Beta-VAE
lilianweng.github.io/posts/2018-08-12-vae
Complete blog post with excellent visualizations
Jaan Altosaar's VAE Tutorial
jaan.io/what-is-variational-autoencoder-vae-tutorial
Clear mathematical derivations with intuitive explanations
Pythae: Unifying VAE Framework
github.com/clementchadebec/benchmark_VAE
Production-ready implementations with 15+ VAE variants
AntixK/PyTorch-VAE
github.com/AntixK/PyTorch-VAE
18+ VAE variants trained on CelebA for comparison
Awesome VAEs Collection
github.com/matthewvowels1/Awesome-VAEs
Curated list of ~900 papers on VAEs and disentanglement
Books and Surveys¤
Murphy, K. P. (2022). "Probabilistic Machine Learning: Advanced Topics"
Chapter on variational inference and deep generative models
Complete treatment connecting theory and practice
Foster, D. (2019). "Generative Deep Learning"
O'Reilly book with practical VAE implementations
Covers VAE, GAN, and autoregressive models
Zhang, C., et al. (2021). "An Overview of Variational Autoencoders for Source Separation, Finance, and Bio-Signal Applications"
PMC8774760
Survey of VAE applications across domains
Ready to implement VAEs? Start with the VAE User Guide for practical usage, check the API Reference for complete documentation, or dive into the MNIST Tutorial for hands-on experience!