Tag: generative

39 topic(s)

Reparameterization Trick (VAE)The reparameterization trick writes a stochastic latent sample as a differentiable transformation of parameters and noise, typically z equals mu plus sigma times epsilon. This lets gradients flow through sampling and makes variational autoencoder training practical with backpropagation.
GAN Minimax ObjectiveThe GAN minimax objective sets up a two-player game in which a generator tries to produce samples that fool a discriminator, while the discriminator tries to distinguish real from generated data. At equilibrium the generator matches the data distribution, though the training game is often unstable in practice.
Unsupervised learningUnsupervised learning tries to discover structure in data without labeled targets, such as clusters, latent factors, or a density model. It is used for representation learning, dimensionality reduction, clustering, and generative modeling when explicit supervision is unavailable.
Latent SpaceA latent space is the internal feature space in which a model represents inputs after transformation, often in a form that is more compact or task-relevant than raw data. Distances or directions in latent space can encode meaningful variation, but only relative to the model and objective that learned it.
Denoising Diffusion Probabilistic Models (DDPM)A generative model that learns to reverse a fixed Gaussian corruption process. Ho et al. (2020) showed that predicting the added noise with a neural network, trained by a simple MSE loss on \( T \) diffusion steps, yields state-of-the-art image synthesis — the foundation of all modern image/video diffusion.
Classifier-Free GuidanceClassifier-free guidance is a sampling trick for conditional diffusion models that combines conditional and unconditional predictions to push samples harder toward the prompt. It improves prompt adherence without a separate classifier, but too much guidance can oversaturate images and reduce diversity.
Flow Matching / Rectified FlowA generative modelling framework that learns a time-dependent velocity field mapping noise to data along a fixed probability path. Rectified Flow in particular learns straight-line paths between noise and data samples, enabling 1–4 step sampling with quality matching much deeper diffusion models.
Latent DiffusionRun the diffusion process in the compressed latent space of a pretrained VAE rather than in pixel space. Latent diffusion (Rombach et al., 2022) slashes memory and compute by ~8× for images while preserving sample quality, and is the architecture behind Stable Diffusion, SDXL, SD3, and most text-to-image systems.
Score MatchingAn estimation principle (Hyvärinen, 2005) that fits an unnormalised density by matching the model's score \( \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \) to the data's score. Integration-by-parts eliminates the unknown data-score, yielding a tractable objective that underlies modern score-based diffusion models.
Denoising Score MatchingVincent (2011) showed that the score of a Gaussian-corrupted data distribution \( q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) \) admits a closed-form target, reducing score learning to a simple regression: predict \( (\mathbf{x} - \tilde{\mathbf{x}})/\sigma^2 \). This identity is the algorithmic heart of modern diffusion models.
DDIM (Deterministic Diffusion Sampling)Song, Meng & Ermon (2020) introduced a non-Markovian sampler for DDPM-trained diffusion models that generates samples in a fraction of the steps while matching quality. DDIM's \( \eta = 0 \) limit is a deterministic ODE integrator, enabling latent interpolation and invertibility.
Consistency ModelsSong et al. (2023) train a neural network \( f_\theta(\mathbf{x}_t, t) \) whose output is consistent along the probability-flow ODE trajectories of a diffusion model, so that \( f_\theta(\mathbf{x}_t, t) \approx \mathbf{x}_0 \) for every \( t \). This collapses diffusion sampling to a single step, with optional multi-step refinement for quality.
Diffusion Transformers (DiT)Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
Encodec / Neural Audio CodecsMeta's Encodec (2022) is a neural audio codec that compresses audio to discrete tokens via residual vector quantisation (RVQ) and reconstructs it with a neural decoder. Encodec is the tokeniser of choice for generative audio models (AudioLM, MusicGen, VALL-E), bridging continuous audio and LLM-style discrete modelling.
Wasserstein Distance & Optimal TransportThe \( p \)-Wasserstein distance \( W_p(\mu,\nu) = \inf_{\gamma \in \Pi(\mu,\nu)} \big( \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|^p \big)^{1/p} \) measures the minimum cost of reshaping distribution \( \mu \) into \( \nu \). It underpins WGAN, flow matching, and a whole family of divergences that remain well-behaved when KL blows up.
f-Divergences (Unified View)For any convex \( f \) with \( f(1) = 0 \), the \( f \)-divergence \( D_f(P \| Q) = \mathbb{E}_Q[f(dP/dQ)] \) recovers KL (\( f = t \log t \)), reverse KL, Jensen–Shannon, total variation, \( \chi^2 \), Hellinger, and α-divergences as special cases. The variational (Fenchel) form underlies f-GAN and density-ratio estimation.
Itô Calculus & Stochastic Differential EquationsItô calculus extends ordinary calculus to processes driven by Brownian motion. An SDE \( dX_t = \mu(X_t, t)\,dt + \sigma(X_t, t)\,dW_t \) combines a drift and a diffusion term; Itô's lemma replaces the chain rule. This is the mathematical substrate of score-based diffusion models, flow matching, and neural SDEs.
Fokker–Planck & Probability-Flow ODEThe Fokker–Planck equation \( \partial_t p_t = -\nabla \cdot (f p_t) + \tfrac{1}{2} \nabla^2 : (g g^\top p_t) \) governs how the density of an SDE-driven process evolves. The probability-flow ODE shares these exact marginals with a deterministic vector field, enabling DDIM-style deterministic sampling and likelihood computation.
Variational Autoencoder (VAE)A latent-variable generative model trained by maximising the ELBO \( \mathcal{L}(x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\text{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \). The reparameterisation trick makes the encoder \( q_\phi \) differentiable; the decoder \( p_\theta \) learns to reconstruct \( x \) from latent codes \( z \sim \mathcal{N}(0, I) \).
β-VAE & Disentanglementβ-VAE replaces the ELBO's KL term with a weighted \( \beta \cdot D_{\text{KL}} \). Values \( \beta > 1 \) push the encoder toward an isotropic prior, encouraging each latent dimension to capture one independent factor of variation — the original disentanglement recipe.
Normalizing Flows (RealNVP, Glow)Invertible neural networks \( f_\theta: \mathbb{R}^d \to \mathbb{R}^d \) with tractable Jacobian determinant. The change-of-variables formula \( \log p_X(x) = \log p_Z(f(x)) + \log |\det J_f(x)| \) gives exact likelihood; sampling runs \( f^{-1} \). RealNVP and Glow use coupling layers to make both directions \( O(d) \) per step.
Autoregressive Flows (MAF & IAF)Flows in which the \( i \)-th output depends only on previous inputs \( x_{<i} \), giving a triangular Jacobian. MAF (masked autoregressive flow) has fast density evaluation but slow sampling; IAF (inverse autoregressive flow) is the mirror image — fast sampling, slow density. Both are cornerstones of modern density estimation.
Energy-Based Models (EBM)A generative model \( p_\theta(x) = \exp(-E_\theta(x))/Z(\theta) \) defined by a scalar energy \( E_\theta \). The intractable normaliser \( Z(\theta) = \int e^{-E_\theta(x)} dx \) precludes direct MLE; training uses contrastive divergence, score matching, or noise-contrastive estimation to approximate it.
Restricted Boltzmann Machines (RBM)A bipartite EBM over visible and hidden binary units with energy \( E(v, h) = -v^\top W h - b^\top v - c^\top h \). Conditional independence within each layer gives closed-form conditionals \( p(h\mid v), p(v\mid h) \); Hinton's Contrastive Divergence trains them and the RBM stack forms a deep belief net.
Score-Based SDEs (Continuous-Time Diffusion)Song et al. (2021) showed that discrete-time DDPM and noise-conditional score models are both limits of a continuous-time SDE \( dx = f(x,t)dt + g(t)dW \). The unified framework gives a reverse-time SDE and a probability-flow ODE that share marginals, enabling flexible samplers (Euler, Heun, DPM-Solver) and exact likelihoods.
GAN Family: WGAN, StyleGAN, BigGANThree architectural and objective milestones: WGAN uses the Kantorovich–Rubinstein dual of \( W_1 \) as a smoother critic, StyleGAN introduces AdaIN-controlled style injection for image generation, BigGAN scales class-conditional GANs to 512×512 with orthogonal regularisation and truncation tricks.
U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
Neural Radiance Fields (NeRF) & 3D Gaussian SplattingNeRF encodes a 3-D scene as a continuous function \( (x, y, z, \theta, \phi) \to (\text{colour}, \text{density}) \) queried along camera rays and volume-rendered into pixels. 3D Gaussian Splatting replaces the implicit MLP with an explicit set of anisotropic Gaussians rasterised in real time.
Neural Ordinary Differential EquationsA neural ODE defines the hidden-state evolution as \( dh/dt = f_\theta(h, t) \), integrated by a black-box ODE solver. Training uses the adjoint method to back-propagate at constant memory regardless of solver depth. Connects residual networks to continuous flows and underlies continuous normalising flows and flow matching.
PixelCNN / PixelCNN++Autoregressive image models that factor \( p(x) = \prod_i p(x_i \mid x_{1:i-1}) \) with masked convolutions so each pixel sees only pixels above and to the left. Tractable likelihood and sharp samples; PixelCNN++ improves expressive conditioners (e.g. gated activations, horizontal/vertical stacks).
Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Video Diffusion (Sora, Veo, Gen-3)Extend image-diffusion recipes to video with 3D patch embeddings, temporal attention, and long-context handling. Sora (OpenAI), Veo (Google), and Gen-3 (Runway) train DiT-style transformers over space-time patches of 1–60 second clips, conditioning on rich text captions for controllable generation.
Disentangled Representation LearningDisentangled representation learning seeks latent coordinates that each correspond to separate underlying factors of variation in the data. It is attractive for control and interpretability, but in the unsupervised setting true disentanglement is usually not identifiable without extra inductive bias or supervision.
Autoregressive vs Diffusion TradeoffsAutoregressive models factorise \( p(x) = \prod_t p(x_t \mid x_{<t}) \) and dominate text generation; diffusion models learn a denoising process and dominate continuous-modality generation. The two paradigms differ in likelihood tractability, sampling cost, controllability, and compositionality — and the right choice depends on whether tokens are discrete, parallel decoding is required, and whether log-likelihood or perceptual quality is the figure of merit.
Text-to-Image AlignmentText-to-image (T2I) alignment is the task of making generated images faithfully follow textual prompts — covering spatial layout, attribute binding, count, and style. Modern alignment relies on contrastive image–text encoders (CLIP, SigLIP, T5) injected via cross-attention into a diffusion or flow backbone, plus classifier-free guidance, RLHF-style preference fine-tuning, and reward models that grade prompt adherence.
Generative Model Evaluation (FID, IS, and their limits)Fréchet Inception Distance (FID) and Inception Score (IS) are the standard automated metrics for image generative models; both rely on Inception-v3 features and have well-known biases. Modern T2I evaluation supplements them with CLIPScore, prompt-adherence benchmarks (T2I-CompBench, GenEval), human-preference Elo (ImageReward, HPS), and likelihood / NLL where applicable.
Neural Fields / Implicit Neural RepresentationsA neural field represents a continuous signal with a neural network that maps coordinates to values such as color, density, or signed distance. This makes the model itself a compact continuous representation of an image, shape, or scene, with NeRF as the best-known example.