Variational Autoencoder
Motivation & Context
Generative modelling with latent variables aims to model a data distribution \( p(x) \) through an auxiliary unobserved variable \( z \), with the joint \( p_\theta(x, z) = p_\theta(x \mid z)\, p(z) \) and a simple prior \( p(z) \) (typically an isotropic Gaussian \( \mathcal{N}(0, I) \)). Maximum-likelihood training requires the marginal over \( z \):
Marginal likelihood
Two quantities are intractable for neural-network decoders. The marginal \( p_\theta(x) \) is an integral over a high-dimensional continuous \( z \); no closed form exists when \( p_\theta(x \mid z) \) is a neural net. The posterior \( p_\theta(z \mid x) = p_\theta(x \mid z)\, p(z) / p_\theta(x) \) inherits the same intractability via its denominator. Classical EM works when the posterior is tractable (Gaussian mixtures, factor analysis); it does not here.
Why not a standard autoencoder?
Common misconception: a standard autoencoder already maps \( x \) to a bottleneck code and back, so why bother with a probabilistic formulation. A deterministic autoencoder yields an arbitrary, non-smooth code space with no probabilistic interpretation: sampling \( z \sim p(z) \) and decoding does not produce realistic data because the decoder was never asked to cover a known prior. The VAE adds two couplings — a probabilistic recognition network and a KL penalty against the prior — precisely so that the decoder maps the whole prior to plausible data.
The paper introduces a practical stochastic gradient estimator for this setup: an amortized recognition network, a tight lower bound on \( \log p_\theta(x) \) called the ELBO, and a reparameterization of the sampling step so gradients flow through a differentiable path. Together these let any differentiable encoder/decoder pair be trained end-to-end with SGD on a single scalar objective.