Tag: advanced

248 topic(s)

Multimodal Secure AlignmentMultimodal secure alignment is the problem of making a model's safety behavior consistent across text, images, audio, and mixed-modal inputs. It matters because a model can reconstruct harmful intent across modalities or through images that evade text-only filters, so defenses must align the fused system rather than just one input channel.
Constitutional Classifiers++Constitutional Classifiers++ is a production-oriented jailbreak defense that uses context-aware classifiers and a cascade of cheap and expensive checks to block harmful exchanges efficiently. The system is designed to keep refusal rates and serving cost low while still catching universal jailbreaks that earlier, response-only filters missed.
Continuous Thought Machines (CTM)Continuous Thought Machines are models that make neural timing and synchronization part of the representation, instead of treating layers as purely instantaneous mappings. They use neuron-level temporal processing and support adaptive compute, so the same model can stop early on easy inputs or continue reasoning on harder ones.
Mechanistic OOCR Steering VectorsMechanistic OOCR steering vectors are a proposed explanation for some out-of-context reasoning results: fine-tuning can act like adding an approximately constant steering direction to the residual stream, rather than learning a deeply conditional new algorithm. That helps explain why a tuned behavior can generalize far beyond the fine-tuning data and why injecting or subtracting the vector can often reproduce or remove it.
Critical Representation Fine-Tuning (CRFT)Critical Representation Fine-Tuning (CRFT) is a PEFT method that improves reasoning by editing a small set of causally important hidden states instead of updating model weights broadly. It identifies critical representations through information-flow analysis and learns low-rank interventions on those states while keeping the base model frozen.
Chain-of-Thought MonitorabilityChain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
ZeRO (Zero Redundancy Optimizer)ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and eventually parameters across data-parallel workers so each GPU no longer stores a full copy of the training state. This cuts memory dramatically and makes very large-model training feasible without requiring full model-parallel architectures.
T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
Sparsely-Gated Mixture of Experts (MoE)A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
Neural Turing Machine (NTM)A Neural Turing Machine augments a neural controller with a differentiable external memory that it can read from and write to using soft attention over memory locations. It was an early attempt to learn algorithm-like behavior such as copying and sorting while remaining trainable end to end.
PagedAttentionPagedAttention stores the KV cache in fixed-size non-contiguous blocks, like virtual-memory pages, instead of requiring one contiguous allocation per sequence. This largely removes fragmentation, enables prompt-prefix sharing, and is a key reason vLLM can serve many more concurrent requests.
Speculative DecodingSpeculative decoding speeds up autoregressive generation by letting a small draft model propose several tokens and then having the large target model verify them in parallel. With the rejection-sampling correction from the original algorithm, the output distribution remains exactly the same as sampling from the target model alone.
KL-Divergence Penalty in RLHFThe KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
Proximal Policy Optimization (PPO)Proximal Policy Optimization is a policy-gradient algorithm that improves a policy while clipping how far action probabilities can move from the previous policy in one update. In RLHF it is usually paired with a KL penalty so the model gains reward without drifting too far from a reference model.
xLSTMxLSTM is a family of modern LSTM variants that adds exponential gating and redesigned memory structures, including scalar-memory and matrix-memory forms, to make recurrent models more scalable. The goal is to keep LSTM-style recurrence while improving stability, parallelism, and long-context performance.
minLSTMminLSTM is a simplified LSTM variant designed to remove some of the sequential dependencies that make classical LSTMs expensive while keeping useful gating behavior. The result is a lighter recurrent block that can be trained more efficiently and scaled more easily.
Grouped-Query Attention (GQA)Grouped-query attention shares key and value heads across groups of query heads, reducing KV-cache size and bandwidth during inference. It sits between full multi-head attention and multi-query attention, preserving most quality while making long-context serving cheaper.
Tree of ThoughtTree of Thought extends chain-of-thought by exploring multiple candidate reasoning paths, evaluating intermediate states, and searching over them with strategies such as BFS or DFS. It is useful when solving the task requires branching, backtracking, or comparing alternative partial plans.
QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
MisalignmentMisalignment is the failure mode where optimizing a model for its training objective or proxy reward does not produce the behavior humans actually want. It includes problems like reward hacking, unsafe shortcuts, and goal pursuit that diverges from the intended specification.
Sparse Mixture-of-Experts (MoE) LayerA sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
Router NetworkA router network scores experts or computation paths for each token and decides where that token should be sent in a conditional-compute model such as an MoE. A good router improves specialization while avoiding collapsed routing, overload, and excessive communication.
Expert NetworkAn expert network is one of the specialized submodules inside an MoE layer that processes only the tokens routed to it. Experts usually share the same architecture but learn different functions, so specialization emerges from routing plus load-balancing constraints.
Top-k RoutingTop-k routing sends each token only to the k highest-scoring experts instead of to every expert. This makes MoE computation sparse and efficient, but the choice of k trades off compute cost, robustness, and routing stability.
Load Balancing (MoE)Load balancing in MoE training adds losses or routing constraints so tokens are spread across experts instead of collapsing onto a few popular ones. It matters because uneven routing wastes capacity, creates bottlenecks, and leaves underused experts poorly trained.
Switch TransformerSwitch Transformer is a simplified MoE Transformer that routes each token to exactly one expert in each sparse feed-forward layer. Top-1 routing reduces communication and implementation complexity, enabling very large sparse models, but makes router stability and load balancing especially important.
Preference-Based AlignmentPreference-based alignment trains models from judgments such as ‘response A is better than response B’ instead of only from supervised targets. It is useful when desired behavior is easier for humans to compare than to specify as a single correct answer.
Reinforcement Learning from Human Feedback (RLHF)RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
Constitutional AIConstitutional AI aligns a model using an explicit list of principles that guide critique and revision, reducing the need for dense human feedback on every example. The constitution acts like a rule set for self-improvement, though the resulting behavior still depends on the chosen principles and training procedure.
Direct Preference Optimization (DPO)DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
FlashAttentionFlashAttention is an exact attention algorithm that uses tiling and kernel fusion to minimize reads and writes between GPU HBM and on-chip SRAM. It preserves standard attention outputs while greatly reducing memory traffic, which yields large speed and memory gains on long sequences.
Pipeline ParallelismPipeline parallelism partitions a model by layers across devices and sends microbatches through the partitions like an assembly line. It reduces per-device memory, but pipeline bubbles and stage imbalance can waste throughput if the schedule is poorly tuned.
Tensor ParallelismTensor parallelism shards individual large matrix operations across devices, such as splitting weight matrices by rows or columns. It is effective for very large Transformers, but the frequent collectives mean fast interconnects are important.
Context ParallelismContext parallelism distributes a long sequence across devices so context tokens and their attention-related work are sharded instead of fully replicated. It helps long-context training or inference scale beyond one device, but requires extra communication to preserve exact attention across chunks.
Fully Sharded Data Parallel (FSDP)Fully Sharded Data Parallel shards model parameters, gradients, and optimizer states across data-parallel workers, gathering full parameters only when needed for computation. It is the PyTorch analogue of ZeRO-style training and makes much larger models fit without custom model-parallel code.
Long-Context PretrainingLong-context pretraining trains or continues training a model on examples with much longer sequences so it learns to use distant context instead of only fitting short windows. It is usually needed because simply changing positional scaling or the context limit does not teach robust long-range retrieval or reasoning.
GroundingGrounding means tying a model’s answer to external evidence, inputs, or world state rather than letting it generate from unsupported priors alone. In RAG or tool-use systems, grounding is what makes outputs traceable to retrieved context or observations.
FaithfulnessFaithfulness is whether a model’s output is supported by the provided input, source document, or chain of evidence. It differs from factuality because a summary can be perfectly faithful to a source that contains false claims.
Inference OptimizationInference optimization is the set of techniques that reduce serving latency, memory use, and cost while preserving acceptable quality. Common methods include quantization, batching, KV-cache optimizations, kernel fusion, speculative decoding, and architecture choices that trade a little flexibility for much higher throughput.
Safety AlignmentSafety alignment is the process of making a model reliably avoid harmful, deceptive, or policy-violating behavior while remaining useful. In practice it combines data curation, supervised tuning, preference optimization or RLHF, classifiers, and adversarial evaluation, but it never guarantees perfect safety.
What is a jailbreak in the context of LLMs?In the context of LLMs, a jailbreak is a prompt or interaction pattern that bypasses the model’s safety training or policy enforcement and elicits behavior it was supposed to refuse. Jailbreaks matter because they reveal that aligned behavior can be a thin behavioral layer rather than a deep guarantee.
Adversarial PromptingAdversarial prompting is the deliberate construction of inputs that push a model toward incorrect, unsafe, or unintended behavior. It includes jailbreaks, prompt injection, data exfiltration attempts, and other attacks that exploit weaknesses in instruction-following or context handling.
Mechanistic InterpretabilityMechanistic interpretability treats a neural network as a system to be reverse-engineered into circuits, features, and algorithms. Its goal is not just to correlate neurons with concepts, but to identify the actual internal computations that produce behavior.
Logit LensLogit Lens maps intermediate hidden states through the final unembedding matrix to inspect what tokens each layer already appears to favor. It is a convenient way to watch a Transformer’s computation unfold, though it is only approximate because earlier layers were not trained to be decoded directly.
Sparse Autoencoder (Mechanistic Interpretability)In mechanistic interpretability, a sparse autoencoder is trained on model activations to decompose dense, superposed representations into a larger set of sparse features. This often makes latent structure more interpretable, because individual learned directions can line up with human-readable concepts or behaviors.
Superposition (Neural Networks)Superposition is the phenomenon in which a network stores more features than it has obvious dimensions by packing them into overlapping directions. It explains why single neurons can look polysemantic and why sparse feature dictionaries are often more informative than neuron-by-neuron inspection.
Logit AdjustmentLogit adjustment means modifying logits to account for effects such as class imbalance, prior shift, or calibration goals before taking probabilities or losses. It changes the decision boundary in a simple way by shifting scores rather than changing the underlying representation.
Distributed Computing (ML Training)Distributed computing in ML training spreads computation, memory, or both across many devices and often many machines. It is what makes modern large-model training possible through strategies such as data parallelism, model parallelism, sharding, and pipeline execution.
GRPO (Group Relative Policy Optimization)GRPO is a policy-optimization method that scores sampled responses relative to others in the same group, using those relative rewards to update the policy. Its appeal is that it can improve reasoning performance while avoiding some of the memory overhead of PPO-style critic training.
Activation PatchingActivation patching is a causal analysis method where activations from one run are inserted into another to test which components matter for a given behavior. If patching a layer or head restores the behavior, that component is evidence for being on the relevant causal path.
Kolmogorov-Arnold NetworksKolmogorov-Arnold Networks replace fixed scalar weights on edges with learnable one-dimensional functions, so layers are built from sums of learned univariate transforms rather than simple affine maps. They are motivated by the Kolmogorov-Arnold representation theorem and are often discussed as a more interpretable alternative to MLPs, not a universal replacement.
Double DescentDouble descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.
GrokkingGrokking is a delayed generalization phenomenon in which a model first memorizes the training set and only much later snaps into a simple algorithm that generalizes well. It is interesting because the model already had enough capacity to fit the data, yet the more general solution emerged only after long training and regularization pressure.
Neural Tangent Kernel (NTK)The Neural Tangent Kernel is the kernel that describes how an infinitely wide network trained by small gradient steps evolves around its initialization. In that limit, training becomes equivalent to kernel regression, which explains part of the behavior of very wide networks.
State Space Models / MambaState space models such as Mamba process sequences by evolving a learned hidden state through recurrence rather than full quadratic attention. Their main appeal is linear-time sequence processing with strong long-context efficiency, especially when selective state updates let the model decide what to remember.
Linear AttentionLinear attention is the family of attention mechanisms that rewrites or approximates softmax attention so sequence processing scales roughly linearly instead of quadratically with length. The benefit is efficiency on long contexts, but the tradeoff is that exact softmax behavior is usually lost.
ALiBi (Attention with Linear Biases)ALiBi is a positional method that adds head-specific linear distance penalties directly to attention logits instead of injecting separate position embeddings. Because the bias is built into the score function, models trained with ALiBi often extrapolate to longer contexts better than models tied to a fixed embedding table.
YaRN / NTK-aware RoPE ScalingYaRN and other NTK-aware RoPE-scaling methods extend the usable context of RoPE-based models by rescaling or interpolating rotary frequencies rather than retraining the model from scratch. Their goal is to preserve short-context behavior while making long-range positions less distorted.
Sliding-Window AttentionSliding-window attention restricts each token to attending only within a local context window rather than the entire sequence. This reduces compute and memory from full-context attention and is effective when most useful dependencies are nearby, though it can miss long-range interactions unless combined with global mechanisms.
Multi-head Latent Attention (MLA)Multi-head Latent Attention compresses the keys and values of multi-head attention into a smaller latent representation before use. Its main advantage is a much smaller KV cache and lower decode-time memory bandwidth, which is why it is attractive for long-context serving.
Chinchilla Scaling LawsChinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
Auxiliary Load-Balancing Loss (MoE)The auxiliary load-balancing loss in a Mixture-of-Experts model encourages the router to spread tokens more evenly across experts. Without it, routing often collapses onto a few experts, which wastes capacity and creates severe hot spots in both learning and systems performance.
Muon OptimizerMuon is an optimizer designed especially for matrix-valued parameters that replaces the raw update direction with an orthogonalized one. The point is to respect matrix structure rather than treating every weight tensor as a flattened vector, with the goal of improving training efficiency relative to standard first-order optimizers.
Generalized Advantage Estimation (GAE)Generalized Advantage Estimation is a family of advantage estimators that interpolates between low-variance temporal-difference updates and high-variance Monte Carlo returns using a parameter lambda. It is widely used because it gives a practical bias-variance tradeoff for policy-gradient training.
Process Reward Models (PRM) vs Outcome Reward Models (ORM)Outcome reward models score only the final answer, while process reward models score the intermediate steps of a solution. PRMs provide denser supervision and better guidance for search and long-form reasoning, but they require more fine-grained labels and more complex evaluation.
KTO (Kahneman–Tversky Optimization)KTO is a preference-optimization objective that learns from binary desirable-versus-undesirable labels instead of pairwise rankings. It uses a utility formulation inspired by prospect theory, making it a cheaper alternative when collecting full preference comparisons is too expensive.
RLAIF (RL from AI Feedback)RLAIF replaces human preference labels with judgments produced by another AI model following a rubric. It scales alignment data collection much more cheaply than RLHF, but it also transfers the biases and blind spots of the judge model into the training signal.
Reasoning Models (o1 / R1-style Long-CoT)Reasoning models in the o1 or R1 style are language models trained or prompted to spend extra inference compute on long multi-step reasoning before answering. Their key idea is that better reasoning can come not only from bigger models, but from better search, verification, and credit assignment at inference and post-training time.
Process SupervisionProcess supervision trains a model on the quality of intermediate reasoning steps rather than only on whether the final answer is correct. It improves credit assignment for long solutions and makes verification more local, though collecting reliable step-level labels is expensive.
Monte Carlo Tree Search for LLM ReasoningMonte Carlo Tree Search for LLM reasoning treats partial solution paths as tree nodes, expands candidate continuations, and uses rollouts or value estimates to decide where to search next. It is attractive because it turns one-shot generation into guided search over reasoning trajectories instead of committing immediately to a single chain of thought.
vLLM & Continuous BatchingvLLM is an LLM serving system built around PagedAttention and continuous batching. Instead of waiting for a batch to finish, it admits and schedules requests at each decoding step, which reduces padding waste and improves throughput for variable-length generations.
GPTQ QuantizationA one-shot, layer-wise post-training quantization that pushes LLM weights to 3–4 bits while preserving generation quality. GPTQ reformulates quantization as a per-row error-minimisation problem solved greedily using the inverse Hessian of a small calibration set, achieving 3-bit LLaMA-65B with <1% perplexity loss.
AWQ (Activation-aware Weight Quantization)AWQ is a post-training quantization method that preserves quality by protecting the weights attached to the largest activation channels before rounding. It targets low-bit LLM inference with small accuracy loss and is popular because it is simpler to deploy than Hessian-based methods such as GPTQ.
Attention Sinks / StreamingLLMAttention sinks are the first few tokens in a causal Transformer that absorb disproportionate attention from later positions, even when they carry little semantic content. StreamingLLM exploits this by keeping sink tokens and a short recent window in the KV cache, enabling long streaming inference with bounded memory.
Induction HeadsA specific two-head circuit in Transformer attention that copies the next token after a previous occurrence of the current token — the computational basis for in-context learning. Anthropic showed induction heads form suddenly during training, coinciding with the sharp jump in ICL ability.
Circuit AnalysisThe mechanistic-interpretability practice of identifying subgraphs of weights, residual-stream components, and attention heads that jointly implement a human-interpretable algorithm (indirect object identification, modular addition, greater-than). Circuit analysis produces falsifiable, causal accounts of what a network has learned.
ROME / MEMIT Model EditingRank-one edits to MLP weights that inject a single fact (ROME) or thousands of facts (MEMIT) into a pretrained LLM without retraining. They exploit the observation that MLP blocks act as key–value memories, identify the causal neurons via activation patching, and solve a closed-form optimisation problem for the minimal-norm weight update.
Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
Masked Autoencoder (MAE)A self-supervised ViT pretraining objective: randomly mask 75% of image patches and train an asymmetric encoder–decoder to reconstruct pixel values from the visible 25%. MAE is simple, compute-efficient (the encoder sees only unmasked patches), and produces state-of-the-art ImageNet fine-tuning representations.
DINOv2A self-supervised ViT pretraining recipe from Meta (Oquab et al., 2023) that combines a DINO-style self-distillation objective with an iBOT masked-patch prediction objective and a curated 142M-image dataset. DINOv2 produces general-purpose frozen visual features that outperform task-specific supervised baselines on classification, segmentation, depth, and correspondence.
SigLIPA contrastive image–text pretraining method (Zhai et al., 2023) that replaces CLIP's softmax-over-batch contrastive loss with a pairwise sigmoid binary cross-entropy. SigLIP removes the need for large global batches, scales batch-size-efficiently, and achieves CLIP-level or better zero-shot accuracy at a fraction of the training compute.
Denoising Diffusion Probabilistic Models (DDPM)A generative model that learns to reverse a fixed Gaussian corruption process. Ho et al. (2020) showed that predicting the added noise with a neural network, trained by a simple MSE loss on \( T \) diffusion steps, yields state-of-the-art image synthesis — the foundation of all modern image/video diffusion.
Classifier-Free GuidanceClassifier-free guidance is a sampling trick for conditional diffusion models that combines conditional and unconditional predictions to push samples harder toward the prompt. It improves prompt adherence without a separate classifier, but too much guidance can oversaturate images and reduce diversity.
Flow Matching / Rectified FlowA generative modelling framework that learns a time-dependent velocity field mapping noise to data along a fixed probability path. Rectified Flow in particular learns straight-line paths between noise and data samples, enabling 1–4 step sampling with quality matching much deeper diffusion models.
Latent DiffusionRun the diffusion process in the compressed latent space of a pretrained VAE rather than in pixel space. Latent diffusion (Rombach et al., 2022) slashes memory and compute by ~8× for images while preserving sample quality, and is the architecture behind Stable Diffusion, SDXL, SD3, and most text-to-image systems.
Evidence Lower Bound (ELBO) — DerivationTwo derivations of the ELBO: (i) Jensen's inequality applied to \( \log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z})\,d\mathbf{z} \), and (ii) the identity \( \log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q \| p(\cdot \mid \mathbf{x})) \). Both produce the same bound, but the second makes the gap explicit.
Score MatchingAn estimation principle (Hyvärinen, 2005) that fits an unnormalised density by matching the model's score \( \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \) to the data's score. Integration-by-parts eliminates the unknown data-score, yielding a tractable objective that underlies modern score-based diffusion models.
Denoising Score MatchingVincent (2011) showed that the score of a Gaussian-corrupted data distribution \( q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) \) admits a closed-form target, reducing score learning to a simple regression: predict \( (\mathbf{x} - \tilde{\mathbf{x}})/\sigma^2 \). This identity is the algorithmic heart of modern diffusion models.
DDIM (Deterministic Diffusion Sampling)Song, Meng & Ermon (2020) introduced a non-Markovian sampler for DDPM-trained diffusion models that generates samples in a fraction of the steps while matching quality. DDIM's \( \eta = 0 \) limit is a deterministic ODE integrator, enabling latent interpolation and invertibility.
Consistency ModelsSong et al. (2023) train a neural network \( f_\theta(\mathbf{x}_t, t) \) whose output is consistent along the probability-flow ODE trajectories of a diffusion model, so that \( f_\theta(\mathbf{x}_t, t) \approx \mathbf{x}_0 \) for every \( t \). This collapses diffusion sampling to a single step, with optional multi-step refinement for quality.
Diffusion Transformers (DiT)Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
BYOL / Self-DistillationBootstrap Your Own Latent (Grill et al., 2020) shows that strong visual representations can be learned without negatives : an online network predicts the output of a momentum-updated target network on a different augmentation of the same image. The same template underlies DINO, MoCo-v3, and other non-contrastive SSL methods.
JEPA (Joint Embedding Predictive Architecture)LeCun's self-supervised template (2022) that predicts the representation of a target from the representation of a context, rather than predicting the target itself. By regressing in embedding space, JEPA avoids wasting capacity on irrelevant per-pixel detail; I-JEPA and V-JEPA are concrete instantiations for images and video.
Vision-Language Contrastive Objectives Beyond CLIPSuccessors to CLIP refine its symmetric InfoNCE loss for better efficiency, finer-grained alignment, and scaling. SigLIP replaces softmax with a pairwise sigmoid; LiT freezes a pretrained image tower; ALIGN scales noisily; FILIP does token-level contrast; CoCa adds a captioning head. All share the joint-embedding template.
World ModelA learned predictive model of environment dynamics — given a state and action, predict the next state (and reward). World models enable planning, offline rollouts, and sample-efficient RL. Recent systems (Dreamer, MuZero, Genie) show that powerful latent world models scale to games, robotics, and interactive video generation.
Encodec / Neural Audio CodecsMeta's Encodec (2022) is a neural audio codec that compresses audio to discrete tokens via residual vector quantisation (RVQ) and reconstructs it with a neural decoder. Encodec is the tokeniser of choice for generative audio models (AudioLM, MusicGen, VALL-E), bridging continuous audio and LLM-style discrete modelling.
RetNet / Retention NetworksSun et al. (2023) introduce a Transformer-alternative block whose retention operator admits three equivalent forms: parallel (for training), recurrent (for \( O(1) \) inference per token), and chunkwise-recurrent (for long-sequence training). RetNet aims for RNN-like inference cost with Transformer-like parallelisable training.
RWKVAn RNN-Transformer hybrid (Peng et al., 2023) whose block is a parallelisable linear-attention operation at training time and a simple recurrent state update at inference time. RWKV scales to 14B+ parameters with Transformer-competitive perplexity, offering constant-memory inference.
Hyena / Long ConvolutionsPoli et al. (2023) propose replacing attention with a data-controlled long-range convolution: a filter parameterised implicitly by an MLP-of-positions, applied via FFT for \( O(n \log n) \) cost. Hyena approaches Transformer quality on pretraining perplexity at a fraction of the compute.
xFormers / Memory-Efficient AttentionA library / pattern of attention implementations that avoid materialising the \( n \times n \) attention matrix, reducing memory from \( O(n^2) \) to \( O(n) \). xFormers bundles FlashAttention, Memory-Efficient Attention (Rabe & Staats), block-sparse variants, and ALiBi/RoPE patches under a unified API — a precursor to the default attention kernels shipped in PyTorch 2.0+.
KV Cache Compression (H2O, SnapKV)Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
Chunked PrefillA serving-time technique that breaks the long prefill of a prompt into small chunks and interleaves them with decode steps of other requests. By keeping GPU utilisation high during prefill and avoiding long tail latencies, chunked prefill dramatically improves throughput in mixed-batch LLM serving.
Disaggregated Prefill/Decode ServingDisaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.
Paged vs Block KV CacheTwo allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
Tensor Cores & GEMM FundamentalsTensor cores are specialised matrix-multiply units in NVIDIA GPUs (introduced in Volta, 2017) that execute small mixed-precision matrix-multiply-accumulate (MMA) ops per clock. Peak DL throughput is set by tensor-core flops; structured matrix multiplies (GEMM) that tile the problem to tensor-core shape are how deep learning touches that ceiling.
BF16 / FP8 / MXFP4 Number FormatsThese are low-precision number formats used to trade numerical precision for speed and memory efficiency in modern ML hardware. BF16 is the standard training workhorse, FP8 is increasingly used for faster training and inference, and 4-bit floating formats push efficiency further for aggressive inference optimization.
DeepSpeed ZeRO-Infinity / OffloadingAn extension of ZeRO that offloads optimizer states, gradients, and parameters to CPU RAM and NVMe SSDs, enabling training of trillion-parameter models on modest GPU clusters. ZeRO-Infinity (Rajbhandari et al., 2021) uses bandwidth-aware partitioning and overlap to hide the offload latency.
Ring Attention / Context Parallel for Long SequencesA distributed-attention algorithm that shards an \( n \)-token sequence across \( P \) devices and computes each attention output via a ring of key-value rotations. Ring Attention (Liu et al., 2023) enables context lengths of millions of tokens on multi-GPU clusters with near-linear scaling.
Self-Play Fine-Tuning (SPIN)Chen et al. (2024) cast fine-tuning as a game between the current model and its previous iterate: the new model must distinguish its own generations from demonstrations, improving it without any new human data. SPIN delivers substantial improvements on supervised data alone, bootstrapping from a weak SFT model toward a stronger one.
Iterative DPOIterative DPO repeats a simple loop: sample responses from the current policy, score or label them, apply a DPO-style update, and repeat. It brings online data collection to preference optimization while keeping DPO's simpler training dynamics compared with PPO-based RLHF.
Weak-to-Strong GeneralizationOpenAI's analog for scalable oversight (Burns et al., 2023): can a strong model, fine-tuned on labels from a weaker supervisor, generalise beyond the supervisor's capability? Experiments on NLP and chess tasks show it partially can; the residual quality gap motivates future work on supervising superhuman models.
Debate and Amplification (Scalable Oversight)Two proposed scalable-oversight protocols: debate pits two models against each other before a human judge; amplification recursively decomposes a difficult question into easier sub-questions that the supervisor can answer. Both aim to let weaker supervisors correctly oversee stronger AI by exploiting zero-sum pressure or composition.
Scalable Oversight and HonestyThe alignment sub-problem of producing reliable supervision for AIs that exceed their supervisors' capability. Proposals span protocols (debate, amplification, recursive reward modelling), training signals (constitutional AI, process-reward models), and the narrower but more tractable goal of honesty : aligning the model's outputs with its internal beliefs.
Model Organisms of MisalignmentAn Anthropic research programme (Hubinger et al., 2023): deliberately construct AI systems that exhibit specific, well-defined misalignment (e.g., deceptive reasoning, sandbagging, reward-hacking) to study the dynamics, detection, and removal of such behaviour — analogous to model organisms (mice, yeast) in biology.
Deceptive AlignmentA hypothesised failure mode (Hubinger et al., 2019) in which an AI deliberately behaves well during training and evaluation but pursues different goals at deployment. A deceptively aligned model is instrumentally aligned during oversight and misaligned otherwise, making detection exceptionally hard.
Sleeper AgentsHubinger et al. (2024) demonstrate that LLMs can be deliberately trained to have a backdoor — e.g., produce insecure code when a trigger string appears — and that standard safety training (RLHF, adversarial training, SFT on harmlessness) fails to remove the backdoor. Evidence that once deceptive alignment is present, it may persist through the standard post-training pipeline.
Monosemanticity and SAE FeaturesMonosemanticity is the idea that a single learned feature corresponds to a single interpretable concept rather than many unrelated ones. Sparse autoencoders are used to decompose dense neural activations into a wider sparse basis where such features are easier to identify.
Agent Benchmarks (SWE-bench, GAIA, WebArena)SWE-bench tests whether a model can fix real GitHub issues in code repositories, GAIA tests general tool-using problem solving with automatically checked answers, and WebArena tests web-navigation agents in simulated sites. Together they measure software, reasoning, and browser-action competence rather than just one-shot text generation.
Cramér–Rao Lower BoundThe Cramér–Rao bound states that any unbiased estimator \( \hat\theta \) of a parameter \( \theta \) has variance \( \text{Var}(\hat\theta) \ge 1/I(\theta) \), where \( I(\theta) \) is the Fisher information. It is the foundational efficiency bound of classical statistics and gives an immediate lower bound on the uncertainty of MLE-based procedures.
Gaussian Process RegressionA Gaussian process defines a distribution over functions such that any finite set of evaluations is jointly Gaussian. Given a kernel \( k \) and noisy observations, the posterior at a test point is \( \mathcal{N}(\mu_*, \sigma_*^2) \) with closed-form mean and variance. GP regression gives both predictions and calibrated uncertainty, costs \( O(n^3) \) for \( n \) training points, and is the Bayesian counterpart of kernel ridge regression.
VC Dimension and ShatteringThe Vapnik–Chervonenkis dimension of a hypothesis class \( \mathcal{H} \) is the largest number of points \( \mathcal{H} \) can shatter — label in every possible way. It controls PAC generalisation: if \( \text{VC}(\mathcal{H}) = d \), then with probability \( 1 - \delta \) all \( h \in \mathcal{H} \) satisfy \( L(h) \le \hat L(h) + O(\sqrt{d/n}) \). VC dimension explains why linear classifiers in \( \mathbb{R}^d \) need \( \Omega(d) \) examples and why simple hypothesis classes generalise.
Rademacher ComplexityThe empirical Rademacher complexity of a function class \( \mathcal{F} \) on data \( S = (z_1, \dots, z_n) \) is \( \hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma[\sup_{f \in \mathcal{F}} \tfrac{1}{n}\sum_i \sigma_i f(z_i)] \) — the expected ability of \( \mathcal{F} \) to correlate with random \( \pm 1 \) signs. It is the data-dependent workhorse of modern generalisation bounds, usually tighter than VC, and gives direct norm-based bounds for deep networks.
No-Free-Lunch TheoremThe No-Free-Lunch theorem says that averaged uniformly over all possible target functions, no learning algorithm outperforms any other. In machine learning it means performance gains must come from inductive bias that matches the structure of the problems we actually care about.
Lottery Ticket HypothesisA dense randomly-initialised neural network contains subnetworks ("winning tickets") that — when trained in isolation with their original initialisation — match the full network's accuracy in the same number of steps. This Frankle–Carbin observation motivates one-shot and iterative magnitude pruning as search algorithms for sparse trainable subnetworks, reframing pruning as an initialisation search rather than a post-hoc compression.
Loss Landscape: Flat vs Sharp MinimaFlat minima (low curvature / small Hessian eigenvalues) generalise better than sharp minima (high curvature), empirically and via PAC-Bayes bounds. SGD's noise, large batch sizes, and the Sharpness-Aware Minimisation (SAM) optimiser all interact with this: small-batch SGD prefers flat minima, large-batch SGD falls into sharper ones, and SAM explicitly penalises sharpness during training.
Implicit Regularisation of SGDOver-parameterised networks trained by SGD generalise despite being able to fit pure noise — SGD's trajectory biases the solution toward specific minima. For linear models, gradient flow converges to the minimum-norm interpolator; for deep nets, SGD with small LR and moderate batch behaves like Bayesian inference with an implicit prior on flat minima. This implicit bias is why modern deep learning does not need explicit capacity control.
Deep Q-Network (DQN)DQN parameterises \( Q(s, a; \theta) \) by a CNN and trains it by Q-learning on mini-batches sampled from a replay buffer, with a slowly-updated target network stabilising the bootstrapped target. Mnih et al. (2015) showed a single DQN architecture learning 49 Atari games from raw pixels at human-level on many — the empirical breakthrough that ignited modern deep RL.
Policy Gradient TheoremFor a stochastic policy \( \pi_\theta(a \mid s) \), the gradient of expected return \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] \) is \( \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(A \mid S) \cdot Q^{\pi_\theta}(S, A)] \). No gradient flows through the environment dynamics — the theorem turns RL into a stochastic optimisation over policy parameters. It is the foundation of REINFORCE, actor-critic, PPO, TRPO, and GRPO.
Trust Region Policy Optimization (TRPO)TRPO performs policy updates by solving a constrained optimisation: maximise a surrogate advantage subject to \( \mathbb{E}[D_{\text{KL}}(\pi_\text{old} \| \pi_\theta)] \le \delta \). The KL trust region gives a monotonic-improvement guarantee and prevents collapse under function approximation. TRPO is solved by natural-gradient + line search; PPO is its first-order clipped approximation with near-identical performance at much lower cost.
Prefix-Tuning and Prompt-TuningPrompt-tuning and prefix-tuning freeze the base model and learn only a small set of continuous prompt parameters. Prompt-tuning adds learned embeddings at the input, while prefix-tuning adds learned key-value prefixes inside each attention layer.
Mixture of Depths (MoD)Mixture of Depths (Raposo et al., 2024) lets each token choose whether to go through the expensive self-attention + MLP stack at each layer, or to skip it via a residual. A small router predicts a saliency score; the top-\( k \) tokens per batch compute, the rest pass through. This per-token adaptive compute is the depth-axis counterpart of Mixture-of-Experts (width-axis) and substantially reduces FLOPs at matched quality.
NCCL Collectives: All-Reduce, All-Gather, Reduce-ScatterNCCL (NVIDIA Collective Communications Library) implements ring and tree algorithms for GPU-to-GPU communication primitives — the building blocks of distributed deep learning. All-reduce sums gradients across workers; reduce-scatter + all-gather are the two halves of all-reduce, exploited by ZeRO and FSDP to shard parameters. Understanding these primitives is essential to reasoning about training scalability.
Triton GPU Kernel ProgrammingTriton (OpenAI, 2019) is a Python-embedded DSL for writing high-performance GPU kernels. It exposes tile-level primitives (block pointers, block-scoped arithmetic, automatic vectorisation) while hiding shared-memory management and thread-level scheduling. FlashAttention-2, Mamba kernels, many PyTorch 2 inductor-generated kernels, and most modern custom ops are written in Triton — fast enough to match hand-tuned CUTLASS with far less code.
LLaVA: Visual Instruction TuningLLaVA (Liu et al., 2023) connects a frozen CLIP vision encoder to a frozen LLM via a small learned projection, then instruction-tunes the combined model on GPT-4-generated multimodal dialogues. The recipe is minimal — a single linear (later two-layer MLP) projector — yet competitive with closed VLMs, establishing the canonical open VLM pattern: (pretrained vision encoder) + (learned bridge) + (pretrained LLM) + (visual instruction data).
Audio Language Models (AudioLM, MusicLM)Audio language models tokenise raw audio into discrete codes via neural audio codecs (SoundStream, Encodec), then model sequences of codes with a Transformer — the same next-token-prediction recipe as LLMs, applied to audio. AudioLM (Borsos et al., 2022) uses a two-level hierarchy of semantic and acoustic tokens for speech continuation; MusicLM extends to text-conditioned music generation.
HNSW (Hierarchical Navigable Small World)HNSW is the graph-based approximate nearest-neighbour (ANN) algorithm powering most production vector databases. It maintains a multi-layer proximity graph where each layer is a small-world graph at a different density; search descends from sparse top layers to dense bottom layers by greedy edge-following. Query cost is \( O(\log n) \); recall-latency trade-off is tunable per query.
ColBERT and Late-Interaction RetrievalColBERT (Khattab & Zaharia, 2020) represents each document as a bag of contextualised token vectors rather than a single vector, and scores against a query via MaxSim : for each query token, find its best-matching document token and sum the similarities. Late interaction preserves token-level granularity that pooling destroys, closing the quality gap between dense retrievers and cross-encoders at a fraction of the cost.
Active Inference and the Free-Energy PrincipleFriston's free-energy principle frames perception, learning, and action as minimising a single quantity — variational free energy, a KL-plus-accuracy bound on surprise. Active inference extends the principle to behaviour: agents select actions that minimise expected free energy, simultaneously seeking reward and information. It is a generative-model-based alternative to classical RL with tight links to variational inference, ELBO, and exploration–exploitation.
Wasserstein Distance & Optimal TransportThe \( p \)-Wasserstein distance \( W_p(\mu,\nu) = \inf_{\gamma \in \Pi(\mu,\nu)} \big( \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|^p \big)^{1/p} \) measures the minimum cost of reshaping distribution \( \mu \) into \( \nu \). It underpins WGAN, flow matching, and a whole family of divergences that remain well-behaved when KL blows up.
f-Divergences (Unified View)For any convex \( f \) with \( f(1) = 0 \), the \( f \)-divergence \( D_f(P \| Q) = \mathbb{E}_Q[f(dP/dQ)] \) recovers KL (\( f = t \log t \)), reverse KL, Jensen–Shannon, total variation, \( \chi^2 \), Hellinger, and α-divergences as special cases. The variational (Fenchel) form underlies f-GAN and density-ratio estimation.
Itô Calculus & Stochastic Differential EquationsItô calculus extends ordinary calculus to processes driven by Brownian motion. An SDE \( dX_t = \mu(X_t, t)\,dt + \sigma(X_t, t)\,dW_t \) combines a drift and a diffusion term; Itô's lemma replaces the chain rule. This is the mathematical substrate of score-based diffusion models, flow matching, and neural SDEs.
Fokker–Planck & Probability-Flow ODEThe Fokker–Planck equation \( \partial_t p_t = -\nabla \cdot (f p_t) + \tfrac{1}{2} \nabla^2 : (g g^\top p_t) \) governs how the density of an SDE-driven process evolves. The probability-flow ODE shares these exact marginals with a deterministic vector field, enabling DDIM-style deterministic sampling and likelihood computation.
Reproducing Kernel Hilbert Spaces (RKHS) & the Representer TheoremAn RKHS is a Hilbert space of functions where evaluation at a point can be written as an inner product with a kernel function. The representer theorem says that many regularized empirical-risk problems in an RKHS have solutions that are finite sums of kernel evaluations at the training points, making kernel methods practical.
Mercer's Theorem & Kernel Feature MapsMercer's theorem shows that a continuous positive-semidefinite kernel can be expanded in nonnegative eigenvalues and orthogonal eigenfunctions. This makes kernel functions equivalent to inner products in a possibly infinite-dimensional feature space and motivates the kernel trick.
Natural Gradient & Fisher–Rao GeometryThe natural gradient \( \tilde\nabla_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L} \) preconditions the Euclidean gradient by the inverse Fisher information matrix, yielding steepest descent under the KL-divergence metric on the statistical manifold. It underlies K-FAC, Shampoo, TRPO's trust region, and the original motivation for reparameterisation-invariant optimisation.
Hamiltonian Monte Carlo (HMC) & NUTSHMC augments the target distribution with an auxiliary momentum variable and simulates Hamiltonian dynamics so that proposals move long distances while staying on near-constant-energy shells. It explores complex posteriors dramatically faster than random-walk Metropolis; the No-U-Turn Sampler (NUTS) adapts the integration length automatically.
Concentration Inequalities (Hoeffding, Bernstein, McDiarmid)High-probability bounds on how far a sum or function of independent random variables can deviate from its mean. Hoeffding uses boundedness, Bernstein exploits known variance for a tighter bound, and McDiarmid handles functions whose value changes little when any single argument changes — the workhorses behind PAC / generalization proofs.
PAC-Bayes Generalization BoundsPAC-Bayes bounds the generalization gap of a stochastic classifier \( Q \) by a Kullback–Leibler term against a data-independent prior \( P \): with probability \( 1 - \delta \), \( \mathbb{E}_{h\sim Q}[R(h)] \le \mathbb{E}_{h\sim Q}[\hat R(h)] + O(\sqrt{\text{KL}(Q\|P)/n}) \). Used for non-vacuous bounds on overparameterised networks.
Belief Propagation (Sum–Product Algorithm)A message-passing algorithm that computes exact marginals on tree-structured graphical models in linear time and approximate marginals (loopy BP) on general graphs. Each node sends summarising messages to neighbours; final beliefs equal the product of incoming messages.
β-VAE & Disentanglementβ-VAE replaces the ELBO's KL term with a weighted \( \beta \cdot D_{\text{KL}} \). Values \( \beta > 1 \) push the encoder toward an isotropic prior, encouraging each latent dimension to capture one independent factor of variation — the original disentanglement recipe.
Normalizing Flows (RealNVP, Glow)Invertible neural networks \( f_\theta: \mathbb{R}^d \to \mathbb{R}^d \) with tractable Jacobian determinant. The change-of-variables formula \( \log p_X(x) = \log p_Z(f(x)) + \log |\det J_f(x)| \) gives exact likelihood; sampling runs \( f^{-1} \). RealNVP and Glow use coupling layers to make both directions \( O(d) \) per step.
Autoregressive Flows (MAF & IAF)Flows in which the \( i \)-th output depends only on previous inputs \( x_{<i} \), giving a triangular Jacobian. MAF (masked autoregressive flow) has fast density evaluation but slow sampling; IAF (inverse autoregressive flow) is the mirror image — fast sampling, slow density. Both are cornerstones of modern density estimation.
Energy-Based Models (EBM)A generative model \( p_\theta(x) = \exp(-E_\theta(x))/Z(\theta) \) defined by a scalar energy \( E_\theta \). The intractable normaliser \( Z(\theta) = \int e^{-E_\theta(x)} dx \) precludes direct MLE; training uses contrastive divergence, score matching, or noise-contrastive estimation to approximate it.
Score-Based SDEs (Continuous-Time Diffusion)Song et al. (2021) showed that discrete-time DDPM and noise-conditional score models are both limits of a continuous-time SDE \( dx = f(x,t)dt + g(t)dW \). The unified framework gives a reverse-time SDE and a probability-flow ODE that share marginals, enabling flexible samplers (Euler, Heun, DPM-Solver) and exact likelihoods.
GAN Family: WGAN, StyleGAN, BigGANThree architectural and objective milestones: WGAN uses the Kantorovich–Rubinstein dual of \( W_1 \) as a smoother critic, StyleGAN introduces AdaIN-controlled style injection for image generation, BigGAN scales class-conditional GANs to 512×512 with orthogonal regularisation and truncation tricks.
Neural Radiance Fields (NeRF) & 3D Gaussian SplattingNeRF encodes a 3-D scene as a continuous function \( (x, y, z, \theta, \phi) \to (\text{colour}, \text{density}) \) queried along camera rays and volume-rendered into pixels. 3D Gaussian Splatting replaces the implicit MLP with an explicit set of anisotropic Gaussians rasterised in real time.
Neural Ordinary Differential EquationsA neural ODE defines the hidden-state evolution as \( dh/dt = f_\theta(h, t) \), integrated by a black-box ODE solver. Training uses the adjoint method to back-propagate at constant memory regardless of solver depth. Connects residual networks to continuous flows and underlies continuous normalising flows and flow matching.
Memory-Augmented TransformersTransformer variants that extend effective context length with an external memory — Recurrent Memory Transformers (RMT) pass summary tokens across chunks, Memorizing Transformers retrieve past kNN keys, Infini-attention compresses the tail of context into a linear-attention state. A bridge between fixed-context Transformers and sequence models with unbounded memory.
Set Transformer & Deep Sets (Permutation Invariance)Deep Sets: any permutation-invariant function on sets equals \( \rho(\sum_i \phi(x_i)) \) for learnable \( \phi, \rho \). Set Transformer replaces the sum with self-attention via Induced Set Attention Blocks, giving element-wise interactions while remaining permutation-equivariant.
Sharpness-Aware Minimization (SAM)Minimise a loss whose value is worst-case over a small \( \rho \)-ball of weight perturbations: \( \min_\theta \max_{\|\varepsilon\| \le \rho} \mathcal{L}(\theta + \varepsilon) \). The ascent step \( \varepsilon^\star \approx \rho \, \nabla \mathcal{L}/\|\nabla \mathcal{L}\| \) biases training toward flat minima, improving generalisation across ViT, ResNet, and LLM finetuning.
Shampoo & K-FAC PreconditionersShampoo and K-FAC are second-order-inspired optimizers that precondition gradients with matrix or blockwise curvature information instead of only per-parameter learning rates. They aim to converge in fewer steps than Adam or SGD, especially in large-batch training where curvature estimates are more stable.
Meta-Learning: MAML & ReptileMAML learns an initialisation \( \theta \) such that one or a few SGD steps on a new task yield good performance — formally \( \min_\theta \sum_\tau \mathcal{L}_\tau(\theta - \alpha \nabla \mathcal{L}_\tau(\theta)) \). Reptile is a first-order simplification that moves \( \theta \) toward per-task adapted parameters. Influential early 'learn to learn' recipes, since absorbed into prompt-based few-shot learning.
Conformal PredictionA distribution-free procedure for turning any point predictor into a prediction set with guaranteed finite-sample coverage \( 1 - \alpha \) under exchangeability. Requires only a scoring function and a calibration set; no assumption on the underlying model or data distribution.
Langevin Dynamics & MALALangevin MCMC treats gradient noise as deliberate: proposals drift along \( -\nabla \log \pi(\theta) \) plus Gaussian perturbation so the chain targets \( \pi \). Metropolis-adjusted Langevin (MALA) corrects discretisation bias with an MH acceptance; unadjusted Langevin (ULA) trades bias for simplicity and scales to big data via stochastic gradients (SGLD).
PixelCNN / PixelCNN++Autoregressive image models that factor \( p(x) = \prod_i p(x_i \mid x_{1:i-1}) \) with masked convolutions so each pixel sees only pixels above and to the left. Tractable likelihood and sharp samples; PixelCNN++ improves expressive conditioners (e.g. gated activations, horizontal/vertical stacks).
Self-Rewarding Language ModelsTrain a model to both generate responses and judge them, using the same weights. Each iteration: generate a pool of candidates, self-rate them, extract preference pairs, DPO-train on the pairs. Recursive improvement without external reward data; bottlenecks surface around judgement quality and diversity collapse.
Long-Context Data Recipes (RULER, Needle Variants)Extending effective context beyond 128k requires (a) RoPE-scaling or position-interpolation to keep positional encodings sane, (b) a continued-pretraining dataset with real long documents and synthetic stitched tasks, and (c) evaluation beyond simple needle-in-a-haystack — RULER adds multi-needle, multi-hop, and aggregation subtasks that expose superficial-match shortcuts.
ORPO (Odds-Ratio Preference Optimization)A reference-model-free preference-optimisation objective that combines SFT and preference learning in one loss: \( \mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} - \lambda \log \sigma(\log \text{odds}(y_w) - \log \text{odds}(y_l)) \). Eliminates DPO's reference-model requirement, halving training memory.
SimPO (Simple Preference Optimization)Reference-free preference objective that replaces DPO's reference ratio with a length-normalised policy log-likelihood: \( r(x, y) = \log \pi_\theta(y\mid x) / |y| \). Adds a margin \( \gamma \) to the preferred response. Simpler than DPO, matches or beats it, and length-normalisation reduces verbosity exploitation.
IPO (Identity Preference Optimization)Replaces DPO's sigmoid objective with a squared-error criterion on preference probabilities: \( \mathcal{L}_{\text{IPO}} = \mathbb{E}[(h_\theta(y_w, y_l) - 1/(2\beta))^2] \). Prevents DPO's tendency to over-separate preferred and rejected responses on easy pairs, reducing overfitting and improving generalisation.
RLVR: RL from Verifiable RewardsTrain a reasoning policy with pure RL signals from tasks whose answers are automatically verifiable — math (exact match), code (unit-test execution), proof (checker), chess (engine eval). No preference model needed. The method behind DeepSeek-R1-Zero and the o1-style long-CoT reasoning families.
Reward Hacking & Specification Gaming in RLHFWhen a learned reward model is a proxy for human preference, RL optimisation finds adversarial inputs that maximise reward without matching the true objective — verbose apologies, sycophancy, confident wrong answers, format exploitation. Goodhart's law in practice. Mitigations range from KL penalty and reward normalisation to process supervision and debate.
Graph of ThoughtsGeneralises chain-of-thought and tree-of-thought to an arbitrary DAG of reasoning nodes, each produced by a prompt. Enables aggregation (merge parallel solutions), refinement loops, and reuse of intermediate results. Useful for problems where branch-and-combine dominates (sorting, constraint satisfaction, multi-step planning).
Toolformer & Learned Tool UseToolformer trains a language model to decide when to call external tools and what arguments to send without requiring human demonstrations of tool use. It keeps tool calls only when the returned result improves the continuation, making tool use a self-supervised learning signal.
Agentic Workflows & Multi-Agent OrchestrationSystems that compose multiple LLM calls — planner, executor, critic, tool-user — into an end-to-end workflow. Patterns range from ReAct loops to fixed DAGs (LangGraph) to role-playing ensembles (AutoGPT, BabyAGI). Success requires careful handoff design, termination criteria, and cost control.
FlashAttention-2 and FlashAttention-3FlashAttention-2 and FlashAttention-3 are follow-on attention kernels that keep exact attention outputs while running much faster through better tiling, parallelism, and data movement. FA-2 improves work partitioning on modern GPUs, while FA-3 adds Hopper-specific asynchronous pipelines and low-precision support.
Medusa & EAGLE Speculative-Decoding HeadsSpeculative decoding with learned draft heads instead of a separate draft model. Medusa adds \( K \) small lightweight heads on the base model predicting future tokens; the base verifies tree-hypotheses in one forward pass. EAGLE models the residual stream directly and achieves 3–4× speedup with a tiny draft network.
Lookahead DecodingExact speculative-like decoding without any draft model: maintain a running n-gram 'Jacobi window' that proposes multiple tokens ahead in parallel; verify in one pass. Lossless — outputs match greedy decoding exactly — and requires no extra training. Trades increased per-step compute for fewer sequential steps.
Radix / Prefix-Cache Attention (SGLang)Share the KV cache across requests that start with a common prompt prefix. Store prefix trees keyed by token sequence; on a new request, find the longest matching prefix in the cache and reuse it. Cuts prefill latency and memory use for chat applications with shared system prompts or few-shot contexts.
Quantized KV Cache (int4 / int8 / KIVI)Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
Transcoders & Sparse CrosscodersTranscoders and sparse crosscoders are interpretability models that learn sparse dictionaries linking features across layers rather than explaining one layer in isolation. They are used to trace how a concept is transformed, preserved, or split as it moves through a network.
Causal Scrubbing & Mediation AnalysisRigorous protocols for validating interpretability hypotheses. Causal scrubbing replaces the hypothesised-irrelevant computations with samples from a distribution that should preserve the output; mediation analysis tests whether a candidate component mediates the causal effect of an input on an output. Tools for turning 'this feature looks meaningful' into falsifiable claims.
Circuit Discovery Pipelines (ACDC, Attribution Patching)Automated methods to locate the minimal sub-graph of attention heads and MLP components responsible for a given behaviour. ACDC greedily ablates edges in a causal graph, keeping only those whose removal degrades the behaviour; attribution patching approximates this with a single forward-backward pass per hypothesis.
Representation Engineering & the Refusal DirectionShift model behaviour by directly manipulating residual-stream activations along interpretable directions. The 'refusal direction' (Arditi et al. 2024) is a single direction in activation space whose ablation jailbreaks open-weight chat models, and whose injection forces refusal — evidence that safety training installs a shallow, targeted feature.
Concept Erasure & Null-Space ProjectionRemove a protected concept (gender, ethnicity, refusal, a specific memory) from representations by iteratively projecting activations onto the null space of linear classifiers for that concept. Achieves provable linear guarding of downstream use against the erased attribute, with bounded utility loss.
Differential Privacy & DP-SGDFormal guarantee that an algorithm's output distribution barely changes if any single training example is replaced. DP-SGD achieves \( (\varepsilon, \delta) \)-DP by clipping per-example gradients and adding calibrated Gaussian noise. Central to privacy-preserving ML training on sensitive data.
Model Stealing & Extraction AttacksModel stealing attacks recover a useful copy of a deployed model by querying it and training a substitute on the outputs. Extraction attacks go further and try to recover hidden parameters, decision rules, or embeddings directly, which matters for both proprietary models and privacy-sensitive systems.
Data Poisoning & Backdoor AttacksInsert malicious training examples so the model learns a targeted behaviour — misclassification on a trigger pattern, backdoored refusal bypasses, or degraded accuracy on specific classes. BadNets demonstrated pixel-trigger backdoors; modern LLM poisoning targets alignment-layer susceptibilities and pretraining data.
LLM Watermarking (Kirchenbauer et al.)Embed a statistical signature into generated text that is invisible to humans but detectable by an algorithm with the watermarking secret. Kirchenbauer et al. (2023) partition the vocabulary into a pseudo-random green / red list per step, biasing generation toward green; later detection uses a \( z \)-test on green-token frequency.
Dangerous-Capability Evaluations (Bio, Cyber, Persuasion, Autonomy)Dangerous-capability evaluations are targeted tests for whether a model can meaningfully assist with high-consequence harms such as bio misuse, cyber offense, persuasive manipulation, or autonomous scheming. They are used as deployment-gating evidence because ordinary benchmark gains do not tell you whether a model has crossed a safety-relevant threshold.
Prompt Injection: Taxonomy & DefencesAdversarial instructions embedded in model-accessible content — tool outputs, retrieved documents, emails — that override the user's original task. Direct (in user prompt) vs indirect (in external content). Defences include input filtering, dual-model separation, and structured prompt templates; none is a complete solution.
CTC Loss & RNN-TransducerTwo objectives for training sequence-to-sequence models when alignment between input and output frames is unknown. CTC sums over all alignment paths with blank symbols; RNN-T decomposes into a prediction network and joint network to model output-length independently. Backbones of modern ASR pipelines.
Conformer ArchitectureA hybrid CNN-plus-attention block for speech recognition: each Conformer layer combines multi-head self-attention (global), depthwise convolutions (local), and sandwiched feed-forward modules. Outperforms pure Transformer and pure CNN on LibriSpeech and became the de facto encoder for production ASR.
Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Video Diffusion (Sora, Veo, Gen-3)Extend image-diffusion recipes to video with 3D patch embeddings, temporal attention, and long-context handling. Sora (OpenAI), Veo (Google), and Gen-3 (Runway) train DiT-style transformers over space-time patches of 1–60 second clips, conditioning on rich text captions for controllable generation.
Mean Field Theory of Neural NetworksMean-field theory studies very wide neural networks by tracking distributions of parameters or activations instead of individual weights. It yields clean scaling limits for training dynamics and feature learning, and helps distinguish true feature-learning regimes from the lazy-training NTK regime.
Information Bottleneck TheoryInformation Bottleneck theory studies representations that preserve information about the target while compressing information about the input, often through a trade-off like \(I(Z;Y) - eta I(Z;X)\). It is a useful lens on representation learning and generalization, though its direct explanatory power for deep networks remains debated.
Stability and GeneralizationAn algorithm is uniformly \( \beta \)-stable if replacing one training point changes its output's loss by at most \( \beta \). Bousquet & Elisseeff (2002) proved that \( \beta \)-stability bounds the generalization gap by \( O(\beta + 1/\sqrt{n}) \); Hardt, Recht & Singer (2016) showed SGD on smooth losses is \( O(T/n) \)-stable, giving the first algorithm-dependent generalization bound for deep learning that grows with training time.
Algorithmic Alignment TheoryA neural architecture generalises better on a reasoning task when its computational structure aligns with the algorithm that solves the task. Xu et al. (2020) formalise sample complexity in terms of the number of network modules that must be learned and the per-module learnability, predicting that GNNs (multi-step message passing) align with dynamic-programming algorithms while plain MLPs do not.
Spectral Bias of Neural NetworksSpectral bias is the tendency of gradient-trained neural networks to learn low-frequency or smooth components of a target function before high-frequency ones. This helps explain why neural nets often fit coarse structure early and fine detail later.
Neural CollapseAt the terminal phase of training (TPT) — long after zero training error — the last-layer features and classifier weights of a deep classifier converge to a highly symmetric configuration: per-class feature means form a Simplex Equiangular Tight Frame (ETF), within-class variability collapses to zero, classifier weights align with the class means, and prediction reduces to nearest-class-centre. Papyan, Han & Donoho (2020) established this as a robust empirical phenomenon across architectures and datasets.
Mode Connectivity in Loss LandscapesMode connectivity is the empirical finding that independently trained solutions can often be connected by a low-loss path in parameter space. This suggests that many minima in deep learning are not isolated basins but parts of wider connected regions.
Gradient Noise ScaleA scalar diagnostic that estimates the largest useful batch size by comparing the variance of per-example gradients to the squared norm of the mean gradient: \( B_{\text{noise}} = \operatorname{tr}(\Sigma)/\|g\|^2 \). McCandlish et al. (2018) argue that returns to scaling batch size diminish sharply once \( B \gg B_{\text{noise}} \), giving a principled way to choose batch size during large-scale training.
Adaptive Gradient Clipping (AGC)A per-parameter clipping rule introduced by Brock et al. (2021) that bounds each weight's update by a fraction of the weight's own norm: clip \( g \) so \( \|g\|/\|w\| \le \lambda \). Unlike global-norm clipping, AGC scales naturally with parameter magnitude and made it possible to train Normalizer-Free Networks (NFNets) without batch normalisation while matching its training stability.
Self-Paced LearningA curriculum-learning variant where the model itself decides which examples are 'easy enough' to train on at the current step, by minimising a joint objective \( \sum_i v_i \ell_i - \lambda \sum_i v_i \) over both parameters \( \theta \) and per-example weights \( v_i \in \{0,1\} \) (or \( [0,1] \)). Kumar, Packer & Koller (2010) introduced it as a non-convex EM-style alternative to handcrafted curricula; \( \lambda \) is annealed from low (only easy examples) to high (all examples).
Loss Landscape VisualizationMethods for visualising high-dimensional loss surfaces by projecting parameters onto 1-D or 2-D subspaces. Goodfellow's linear interpolation (2014) plots loss along the line between two solutions; Li et al.'s filter normalisation (2018) plots loss in a 2-D plane spanned by random Gaussian directions normalised per-filter. The latter reveals that residual connections smooth the landscape and that flat minima correspond to wide bowls in the visualisation.
Gradient Surgery (PCGrad) for Multi-Task LearningWhen two task gradients in multi-task learning point in conflicting directions (negative cosine), they partially cancel each other and slow learning. Yu et al.'s PCGrad (2020) projects each task gradient onto the normal plane of any conflicting task's gradient before summing, removing the destructive component. This 'gradient surgery' restores monotone progress on both tasks at modest cost.
Contrastive Learning TheoryThe theoretical account of why contrastive self-supervised objectives like InfoNCE produce useful representations. Wang & Isola (2020) show the InfoNCE loss decomposes into two asymptotic terms — \( \mathcal{L}_{\text{align}} \), pulling positive pairs together, and \( \mathcal{L}_{\text{unif}} \), spreading the marginal feature distribution uniformly on the hypersphere. The downstream linear-probe accuracy correlates almost perfectly with this alignment-uniformity trade-off, giving a geometric explanation for why contrastive learning works at all.
Non-Contrastive SSL (BYOL, SimSiam, Barlow Twins, VICReg)Non-contrastive self-supervised learning learns representations from multiple views of the same example without explicit negative pairs. Methods such as BYOL, SimSiam, Barlow Twins, and VICReg avoid collapse using asymmetry, stop-gradient, redundancy reduction, or variance-preserving terms instead of contrastive negatives.
Disentangled Representation LearningDisentangled representation learning seeks latent coordinates that each correspond to separate underlying factors of variation in the data. It is attractive for control and interpretability, but in the unsupervised setting true disentanglement is usually not identifiable without extra inductive bias or supervision.
Sparse Representations in Deep NetsA sparse representation is one where only a small fraction of units are active for any given input. Deep nets often develop sparsity through ReLU-like nonlinearities or explicit penalties, which can improve efficiency, feature selectivity, and sometimes interpretability.
Representation CollapseRepresentation collapse is the failure mode where many inputs map to nearly the same embedding or hidden state, destroying useful information. It appears in several forms — constant-vector collapse in self-supervision, dimensional collapse where only a few directions survive, and cluster collapse in discrete latents — and each requires a different fix.
Invariance vs EquivarianceA representation is invariant to a transformation if the output does not change when the input is transformed, and equivariant if the output changes in a predictable transformed way. CNN translation equivariance and classifier translation invariance are the canonical example pair.
Metric Learning at ScaleMetric learning at scale trains embeddings so similar items are close and dissimilar items are far apart even when the dataset is too large for naive pairwise or triplet mining. The main challenge is finding informative negatives and keeping computation manageable as the corpus grows.
Representation Alignment Across ModalitiesRepresentation alignment across modalities trains different encoders so paired inputs, such as an image and its caption, land near each other in a shared embedding space. This makes cross-modal retrieval and transfer possible by giving different modalities a common geometry.
Autoregressive vs Diffusion TradeoffsAutoregressive models factorise \( p(x) = \prod_t p(x_t \mid x_{<t}) \) and dominate text generation; diffusion models learn a denoising process and dominate continuous-modality generation. The two paradigms differ in likelihood tractability, sampling cost, controllability, and compositionality — and the right choice depends on whether tokens are discrete, parallel decoding is required, and whether log-likelihood or perceptual quality is the figure of merit.
Text-to-Image AlignmentText-to-image (T2I) alignment is the task of making generated images faithfully follow textual prompts — covering spatial layout, attribute binding, count, and style. Modern alignment relies on contrastive image–text encoders (CLIP, SigLIP, T5) injected via cross-attention into a diffusion or flow backbone, plus classifier-free guidance, RLHF-style preference fine-tuning, and reward models that grade prompt adherence.
In-Context Learning MechanismsIn-context learning (ICL) is the empirical phenomenon that a frozen LLM solves new tasks from few-shot examples in the prompt. Mechanistic studies show ICL is implemented by a small set of attention circuits — induction heads, function vectors, and implicit gradient-descent-like updates — that emerge during pretraining once the data and depth budget cross a threshold.
Alignment Techniques (RLHF, DPO, RLAIF, comparison)Modern LLM alignment uses preference data to adjust a pretrained model so it follows instructions, refuses unsafe content, and ranks desired behaviours above undesired ones. The dominant recipes — RLHF with PPO, DPO and its variants, and RLAIF with AI-generated preferences — share the same Bradley–Terry preference model but differ in optimiser, reward-model dependence, and stability.
Neural Fields / Implicit Neural RepresentationsA neural field represents a continuous signal with a neural network that maps coordinates to values such as color, density, or signed distance. This makes the model itself a compact continuous representation of an image, shape, or scene, with NeRF as the best-known example.
Vision Transformer (ViT) VariantsSince the original ViT, a wide family of variants has emerged that improve data efficiency, locality, hierarchy, and pretraining objective. The most influential are DeiT (training recipe), Swin (windowed hierarchical attention), MAE (masked-image pretraining), DINOv2 (self-distilled features), and SigLIP (sigmoid contrastive pretraining). Each addresses a specific weakness of the vanilla ViT.
Graph TransformersGraph Transformers apply self-attention over graph-structured data, with positional encodings (Laplacian eigenvectors, random walks, shortest-path distances) injecting graph topology that vanilla attention lacks. They generalise message-passing GNNs and have become the leading architecture for molecular property prediction, code understanding, and combinatorial optimisation.
Perceiver ArchitectureThe Perceiver uses cross-attention from a small latent array to a potentially very large input, then performs most computation in latent space. This decouples cost from input length and makes one architecture usable across images, audio, video, and other modalities.
Neural Architecture Search (modern approaches)NAS automates the design of network architectures by searching a parameterised space against a validation objective. Modern NAS abandons the slow RL-controller approach (NASNet) in favour of weight-sharing one-shot supernets (DARTS, ProxylessNAS), zero-cost proxies, and architecture-aware scaling laws (EfficientNet, NFNet). NAS has produced strong vision backbones but plays a smaller role in the LLM era.
Modular Neural NetworksA modular network composes specialised sub-networks (modules) under a routing or composition rule that decides which modules process each input. Mixture-of-Experts is the most successful instance, but the family includes routed adapter networks, modular meta-learners, and compositional architectures designed for systematic generalisation. The motivation is parameter efficiency and reusable skills.
Neural Program InterpretersA neural program interpreter (NPI) is a network that executes program-like computations: looking up arguments, calling sub-routines, manipulating an external memory or stack, and conditioning on intermediate state. Early work (NPI, NTM, DNC, NeuralGPU) targeted symbolic algorithms; modern descendants are tool-using LLMs and chain-of-thought executors that lean on external interpreters and structured memory.
Compressed Sparse Attention (CSA)Compressed Sparse Attention (CSA) is a long-context attention scheme that first compresses the KV cache into block summaries and then performs sparse attention only over the top-k relevant compressed blocks. An added sliding-window branch preserves exact local dependencies, so CSA cuts both KV memory and long-context attention compute without collapsing into a purely local window.
Offline Reinforcement LearningOffline reinforcement learning learns a policy from a fixed logged dataset without further interaction with the environment. Its central difficulty is distribution shift: Bellman backups evaluate actions that are poorly supported by the data, so modern methods either constrain the learned policy to stay near the behavior data or pessimistically down-value unsupported actions.
Model-Based Reinforcement LearningModel-based reinforcement learning learns an explicit or latent model of environment dynamics and uses that model for planning, imagination rollouts, or policy optimization. Its main advantage is sample efficiency, while its main failure mode is model bias: the policy can exploit errors in the learned simulator unless planning and training control compounding prediction error.
Decision TransformersDecision Transformers cast offline reinforcement learning as conditional sequence modeling: a Transformer predicts the next action from past returns-to-go, states, and actions. This avoids explicit Bellman backups and instead treats policy learning like autoregressive imitation conditioned on the desired future return.
Multi-Agent Reinforcement LearningMulti-agent reinforcement learning studies environments where several learning agents interact simultaneously, making each agent's dynamics depend on the evolving policies of the others. The main challenges are non-stationarity, coordination, and credit assignment, which is why centralized training with decentralized execution is a common modern design.
Reward HackingReward hacking occurs when an agent maximizes the formal reward signal while failing at the designer's intended objective. It is a general Goodhart-style failure mode in reinforcement learning: stronger optimization pressure often finds loopholes in the proxy reward faster than humans can patch them.
Safe Reinforcement LearningSafe reinforcement learning studies how to optimize long-term return while satisfying safety constraints during training and deployment. The standard formalism is a constrained Markov decision process, where the policy must maximize reward subject to a bound on expected cost, risk, or unsafe-state visitation.
Hierarchical Reinforcement LearningHierarchical reinforcement learning decomposes control across time scales, usually by letting a high-level policy choose skills, options, or subgoals and a low-level policy execute them. This can make sparse-reward and long-horizon problems easier, but only if the learned hierarchy discovers reusable abstractions rather than collapsing back to flat control.
Distributed Training (Data & Model Parallelism)Distributed training scales learning across many devices by splitting either the data, the model, or both. Data parallelism is simplest when the model fits on each device, while model parallel approaches such as tensor and pipeline parallelism are needed when parameters or activations are too large for one accelerator.
Serving LLMs at ScaleServing LLMs at scale is a systems problem of jointly optimizing prompt prefill throughput, token-by-token decode latency, KV-cache memory, batching policy, and fleet utilization. Modern serving stacks rely on continuous batching, prefix caching, PagedAttention, speculative decoding, and sometimes prefill/decode disaggregation to keep both tail latency and GPU cost under control.
Adversarial Robustness (Modern Attacks)Adversarial robustness studies how learned models can be forced into wrong predictions by carefully chosen perturbations that are small, structured, or semantically deceptive. Modern attack families include gradient-based \(\ell_p\) attacks, universal perturbations, adversarial patches, and transfer attacks; defenses must avoid merely hiding gradients while leaving the model fragile.
Uncertainty Estimation in Deep LearningUncertainty estimation in deep learning tries to quantify when a model should be unsure, not just what label it predicts. The key distinction is between aleatoric uncertainty from irreducible noise in the data and epistemic uncertainty from limited knowledge of the model, and modern methods such as deep ensembles, MC dropout, and conformal prediction target those uncertainties differently.
ML System Monitoring & Drift DetectionML system monitoring tracks whether a deployed model is still receiving the kind of data it was built for and whether its business and technical behavior remain acceptable. Drift detection is one part of that job: teams also monitor latency, calibration, feature freshness, label delay, feedback loops, and downstream outcomes, because data drift alone does not tell the whole production story.
RLHF as KL-Regularized Policy OptimizationA deeper theoretical view of RLHF treats post-training as optimizing a policy against a learned reward while regularizing toward a reference model with a KL penalty. This viewpoint explains why PPO-RLHF, reward-model training, and even DPO-style objectives are closely related: they are different ways of solving or approximating the same regularized preference-optimization problem.
Particle FilterA particle filter approximates the posterior over a hidden state with weighted samples, or particles, instead of a single Gaussian. It is useful for nonlinear or non-Gaussian state-space models, but resampling and weight degeneracy are central practical issues.
Canonical Correlation Analysis (CCA)Canonical correlation analysis finds linear combinations of two random vectors that are maximally correlated with each other. It is the right tool when the question is about shared structure between two views of the same examples rather than variance within a single view.
Instrumental VariablesInstrumental variables identify causal effects when treatment is confounded, provided an instrument affects treatment, is as-good-as random with respect to unobserved confounders, and influences the outcome only through treatment. In simple linear settings, the IV estimand is the ratio of the instrument-outcome covariance to the instrument-treatment covariance.
Do-CalculusDo-calculus is Pearl's set of graphical transformation rules for turning interventional quantities into observational quantities when the causal graph permits it. It matters because it separates what can be identified from data plus structure from what remains fundamentally unidentifiable.