Tag: path-generative-models

24 topic(s)

Wasserstein Distance & Optimal TransportThe \( p \)-Wasserstein distance \( W_p(\mu,\nu) = \inf_{\gamma \in \Pi(\mu,\nu)} \big( \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|^p \big)^{1/p} \) measures the minimum cost of reshaping distribution \( \mu \) into \( \nu \). It underpins WGAN, flow matching, and a whole family of divergences that remain well-behaved when KL blows up.
f-Divergences (Unified View)For any convex \( f \) with \( f(1) = 0 \), the \( f \)-divergence \( D_f(P \| Q) = \mathbb{E}_Q[f(dP/dQ)] \) recovers KL (\( f = t \log t \)), reverse KL, Jensen–Shannon, total variation, \( \chi^2 \), Hellinger, and α-divergences as special cases. The variational (Fenchel) form underlies f-GAN and density-ratio estimation.
Itô Calculus & Stochastic Differential EquationsItô calculus extends ordinary calculus to processes driven by Brownian motion. An SDE \( dX_t = \mu(X_t, t)\,dt + \sigma(X_t, t)\,dW_t \) combines a drift and a diffusion term; Itô's lemma replaces the chain rule. This is the mathematical substrate of score-based diffusion models, flow matching, and neural SDEs.
Fokker–Planck & Probability-Flow ODEThe Fokker–Planck equation \( \partial_t p_t = -\nabla \cdot (f p_t) + \tfrac{1}{2} \nabla^2 : (g g^\top p_t) \) governs how the density of an SDE-driven process evolves. The probability-flow ODE shares these exact marginals with a deterministic vector field, enabling DDIM-style deterministic sampling and likelihood computation.
Variational Autoencoder (VAE)A latent-variable generative model trained by maximising the ELBO \( \mathcal{L}(x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\text{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \). The reparameterisation trick makes the encoder \( q_\phi \) differentiable; the decoder \( p_\theta \) learns to reconstruct \( x \) from latent codes \( z \sim \mathcal{N}(0, I) \).
β-VAE & Disentanglementβ-VAE replaces the ELBO's KL term with a weighted \( \beta \cdot D_{\text{KL}} \). Values \( \beta > 1 \) push the encoder toward an isotropic prior, encouraging each latent dimension to capture one independent factor of variation — the original disentanglement recipe.
Normalizing Flows (RealNVP, Glow)Invertible neural networks \( f_\theta: \mathbb{R}^d \to \mathbb{R}^d \) with tractable Jacobian determinant. The change-of-variables formula \( \log p_X(x) = \log p_Z(f(x)) + \log |\det J_f(x)| \) gives exact likelihood; sampling runs \( f^{-1} \). RealNVP and Glow use coupling layers to make both directions \( O(d) \) per step.
Autoregressive Flows (MAF & IAF)Flows in which the \( i \)-th output depends only on previous inputs \( x_{<i} \), giving a triangular Jacobian. MAF (masked autoregressive flow) has fast density evaluation but slow sampling; IAF (inverse autoregressive flow) is the mirror image — fast sampling, slow density. Both are cornerstones of modern density estimation.
Energy-Based Models (EBM)A generative model \( p_\theta(x) = \exp(-E_\theta(x))/Z(\theta) \) defined by a scalar energy \( E_\theta \). The intractable normaliser \( Z(\theta) = \int e^{-E_\theta(x)} dx \) precludes direct MLE; training uses contrastive divergence, score matching, or noise-contrastive estimation to approximate it.
Restricted Boltzmann Machines (RBM)A bipartite EBM over visible and hidden binary units with energy \( E(v, h) = -v^\top W h - b^\top v - c^\top h \). Conditional independence within each layer gives closed-form conditionals \( p(h\mid v), p(v\mid h) \); Hinton's Contrastive Divergence trains them and the RBM stack forms a deep belief net.
Noise-Contrastive Estimation (NCE)Learn an unnormalised model \( \tilde p_\theta(x) \) by training a binary classifier to distinguish data samples from noise samples. The classifier's logit becomes \( \log \tilde p_\theta(x) - \log q_{\text{noise}}(x) \), so the partition function is absorbed into a learnable constant. Foundation of word2vec's negative sampling and of InfoNCE contrastive learning.
Score-Based SDEs (Continuous-Time Diffusion)Song et al. (2021) showed that discrete-time DDPM and noise-conditional score models are both limits of a continuous-time SDE \( dx = f(x,t)dt + g(t)dW \). The unified framework gives a reverse-time SDE and a probability-flow ODE that share marginals, enabling flexible samplers (Euler, Heun, DPM-Solver) and exact likelihoods.
GAN Family: WGAN, StyleGAN, BigGANThree architectural and objective milestones: WGAN uses the Kantorovich–Rubinstein dual of \( W_1 \) as a smoother critic, StyleGAN introduces AdaIN-controlled style injection for image generation, BigGAN scales class-conditional GANs to 512×512 with orthogonal regularisation and truncation tricks.
U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
Neural Radiance Fields (NeRF) & 3D Gaussian SplattingNeRF encodes a 3-D scene as a continuous function \( (x, y, z, \theta, \phi) \to (\text{colour}, \text{density}) \) queried along camera rays and volume-rendered into pixels. 3D Gaussian Splatting replaces the implicit MLP with an explicit set of anisotropic Gaussians rasterised in real time.
Neural Ordinary Differential EquationsA neural ODE defines the hidden-state evolution as \( dh/dt = f_\theta(h, t) \), integrated by a black-box ODE solver. Training uses the adjoint method to back-propagate at constant memory regardless of solver depth. Connects residual networks to continuous flows and underlies continuous normalising flows and flow matching.
InfoNCE & NT-Xent Contrastive LossesInfoNCE maximises a mutual-information lower bound by classifying a positive pair against \( k \) negatives: \( \mathcal{L} = -\log \exp(s^+) / \sum_i \exp(s_i) \). NT-Xent is InfoNCE with temperature-scaled cosine similarities. Drives SimCLR, MoCo, CLIP, and most modern self-supervised representation learning.
PixelCNN / PixelCNN++Autoregressive image models that factor \( p(x) = \prod_i p(x_i \mid x_{1:i-1}) \) with masked convolutions so each pixel sees only pixels above and to the left. Tractable likelihood and sharp samples; PixelCNN++ improves expressive conditioners (e.g. gated activations, horizontal/vertical stacks).
Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Video Diffusion (Sora, Veo, Gen-3)Extend image-diffusion recipes to video with 3D patch embeddings, temporal attention, and long-context handling. Sora (OpenAI), Veo (Google), and Gen-3 (Runway) train DiT-style transformers over space-time patches of 1–60 second clips, conditioning on rich text captions for controllable generation.
Autoregressive vs Diffusion TradeoffsAutoregressive models factorise \( p(x) = \prod_t p(x_t \mid x_{<t}) \) and dominate text generation; diffusion models learn a denoising process and dominate continuous-modality generation. The two paradigms differ in likelihood tractability, sampling cost, controllability, and compositionality — and the right choice depends on whether tokens are discrete, parallel decoding is required, and whether log-likelihood or perceptual quality is the figure of merit.
Text-to-Image AlignmentText-to-image (T2I) alignment is the task of making generated images faithfully follow textual prompts — covering spatial layout, attribute binding, count, and style. Modern alignment relies on contrastive image–text encoders (CLIP, SigLIP, T5) injected via cross-attention into a diffusion or flow backbone, plus classifier-free guidance, RLHF-style preference fine-tuning, and reward models that grade prompt adherence.
Generative Model Evaluation (FID, IS, and their limits)Fréchet Inception Distance (FID) and Inception Score (IS) are the standard automated metrics for image generative models; both rely on Inception-v3 features and have well-known biases. Modern T2I evaluation supplements them with CLIPScore, prompt-adherence benchmarks (T2I-CompBench, GenEval), human-preference Elo (ImageReward, HPS), and likelihood / NLL where applicable.