Tag: theory

432 topic(s)

Test-Time Compute ScalingTest-time compute scaling improves a model by spending extra computation at inference time, for example through search, verification, reranking, or adaptive refinement, instead of only scaling pretraining. It is most useful on prompts where the base model already has some chance of success, because additional compute can then amplify that success more efficiently than a much larger one-shot model.
GPT-3 & Few-Shot In-Context LearningGPT-3 showed that a 175B-parameter autoregressive Transformer can perform many tasks from natural-language instructions and a few demonstrations in the prompt, without gradient updates or task-specific fine-tuning. That result made in-context learning a central paradigm and showed that scale alone could unlock strong few-shot behavior.
The Bitter LessonThe Bitter Lesson is Sutton's argument that, over the long run, general methods that scale with compute and data outperform systems built around hand-crafted domain knowledge. It is a historical pattern claim, not a theorem, and its force comes from repeated examples in search, game playing, vision, and language.
Continuous Thought Machines (CTM)Continuous Thought Machines are models that make neural timing and synchronization part of the representation, instead of treating layers as purely instantaneous mappings. They use neuron-level temporal processing and support adaptive compute, so the same model can stop early on easy inputs or continue reasoning on harder ones.
Mechanistic OOCR Steering VectorsMechanistic OOCR steering vectors are a proposed explanation for some out-of-context reasoning results: fine-tuning can act like adding an approximately constant steering direction to the residual stream, rather than learning a deeply conditional new algorithm. That helps explain why a tuned behavior can generalize far beyond the fine-tuning data and why injecting or subtracting the vector can often reproduce or remove it.
Chain-of-Thought MonitorabilityChain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
ELMo (Embeddings from Language Models)ELMo produces contextualized word embeddings by taking a learned task-specific combination of hidden states from a pretrained bidirectional LSTM language model. Unlike static embeddings such as word2vec or GloVe, it gives the same word different vectors in different sentence contexts.
Sparsely-Gated Mixture of Experts (MoE)A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
Key-Value Memory NetworksKey-Value Memory Networks store each memory slot as a key for retrieval and a separate value for the returned content. This decouples matching from payload and is a direct conceptual precursor to modern query-key-value attention.
Luong Attention (Global and Local)Luong attention is a sequence-to-sequence attention mechanism that scores decoder states against encoder states using multiplicative forms such as dot or bilinear attention. It distinguishes global attention over all source positions from local attention over a predicted window, helping make neural machine translation more scalable.
GloVe Word EmbeddingsGloVe learns word embeddings by fitting vector dot products to the log of global word-word co-occurrence counts. Because it is trained on ratios of co-occurrence statistics, linear relations such as king minus man plus woman approximately equals queen often emerge in the embedding space.
Neural Turing Machine (NTM)A Neural Turing Machine augments a neural controller with a differentiable external memory that it can read from and write to using soft attention over memory locations. It was an early attempt to learn algorithm-like behavior such as copying and sorting while remaining trainable end to end.
Xavier/Glorot InitializationXavier or Glorot initialization chooses weight variance from fan-in and fan-out so activations and gradients stay roughly stable across deep layers. It is well suited to symmetric activations such as tanh, while ReLU networks usually prefer He initialization.
Neural Probabilistic Language ModelThe Neural Probabilistic Language Model replaced count-based n-grams with learned word embeddings and a neural network that predicts the next word from a continuous representation of context. Its core contribution was showing that distributed representations let language models generalize to unseen but similar word sequences.
Weight TyingWeight tying uses the same matrix for token embeddings and the output softmax projection, typically by setting the output weights to the transpose of the input embedding table. This cuts parameters and often improves language modeling by forcing input and output token representations to share geometry.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
KL-Divergence Penalty in RLHFThe KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
Proximal Policy Optimization (PPO)Proximal Policy Optimization is a policy-gradient algorithm that improves a policy while clipping how far action probabilities can move from the previous policy in one update. In RLHF it is usually paired with a KL penalty so the model gains reward without drifting too far from a reference model.
Next-Token Prediction Objective (Causal Language Modeling)Next-token prediction trains a causal language model to assign high probability to each token given all previous tokens. Maximizing this likelihood over large text corpora teaches the model syntax, facts, and reusable patterns that later support prompting and generation.
SwiGLU Activation FunctionSwiGLU is a gated feed-forward activation that multiplies one linear projection by a Swish-activated gate from another projection. It usually performs better than standard ReLU-style MLP blocks at similar scale, which is why many modern LLMs use it in their feed-forward layers.
Pre-Norm vs. Post-Norm ArchitecturePre-Norm vs. Post-Norm is the choice of whether layer normalization is applied before or after each residual sublayer in a Transformer block. Pre-Norm usually trains deeper stacks more stably by preserving gradient flow through the residual path, while Post-Norm was the original design and can be less stable at scale.
Rotary Positional Embedding (RoPE)Rotary Positional Embedding encodes position by rotating query and key vectors with token-index-dependent angles before attention is computed. Because the resulting dot products depend on relative offsets, RoPE gives Transformers a simple and widely used way to represent order.
Sinusoidal Positional EncodingSinusoidal positional encoding adds fixed sine and cosine patterns of different frequencies to token embeddings so the model can infer token order. The encoding is deterministic and smooth across positions, which let the original Transformer represent position without learning a separate table.
Causal (Masked) Self-AttentionCausal masked self-attention is self-attention with a mask that prevents each position from attending to future tokens. Applying the mask before softmax enforces autoregressive order, so the model can predict the next token without seeing the answer in advance.
Message Passing in Graph Neural Networks (GNNs)Message passing in graph neural networks updates each node by aggregating transformed information from its neighbors and combining it with the node's current representation. After K rounds, a node's state depends on its K-hop neighborhood, which is why message passing is the core operation of most spatial GNNs.
Triplet Margin LossTriplet margin loss trains an embedding space so an anchor is closer to a positive example than to a negative example by at least a fixed margin. It is a standard metric-learning objective because it directly enforces relative similarity rather than predicting class labels.
Lagrange MultipliersLagrange multipliers solve constrained optimization problems by introducing auxiliary variables that encode the constraints inside a single objective. At a constrained optimum, the gradient of the objective lies in the span of the constraint gradients, which is why the method is central to duality and SVM derivations.
Ordinary Least Squares (OLS) Closed-Form SolutionThe OLS closed-form solution is the exact least-squares answer computed directly from the design matrix rather than by iterative optimization. In the full-rank case it solves the normal equations, and geometrically it projects the target vector onto the column space of the features.
PAC Learning (Probably Approximately Correct)PAC learning formalizes what it means for a hypothesis class to be learnable: with enough samples, an algorithm should return a hypothesis whose error is small with high probability. It is foundational because sample complexity and model capacity can then be expressed as rigorous guarantees instead of heuristics.
Receiver Operating Characteristic (ROC) & AUCThe ROC curve plots true positive rate against false positive rate as a binary classifier's threshold changes, and AUC summarizes that curve into a single number. AUC also has a ranking interpretation: it is the probability that a random positive example scores above a random negative one.
Hidden Markov Model (HMM)A Hidden Markov Model is a sequence model with an unobserved Markov chain of states and an observed emission distribution from each state. It became a standard model for speech, tagging, and other structured sequence tasks because dynamic programming can efficiently infer likely states and sequence probabilities.
Convex FunctionA convex function is one whose value on any line segment lies below the chord connecting that segment's endpoints. This matters in optimization because convex problems have no spurious local minima: every local minimum is global.
Shannon EntropyShannon entropy measures the expected surprisal of a random variable and quantifies how uncertain its outcomes are. It is the basic information-theoretic quantity from which cross-entropy, KL divergence, mutual information, and many ML loss functions are built.
L1 vs. L2 NormsThe L1 norm sums absolute values and tends to promote sparsity when used as a penalty, while the L2 norm measures Euclidean length and tends to shrink weights smoothly without zeroing many of them out. That difference is why L1 is associated with feature selection and L2 with stable shrinkage.
The Curse of DimensionalityThe curse of dimensionality is the collection of high-dimensional effects that make data sparse, neighborhoods less informative, and sample requirements explode as dimension grows. It helps explain why distance-based methods, density estimation, and exhaustive search often break down in large feature spaces.
Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo samples from a difficult target distribution by constructing a Markov chain whose stationary distribution matches that target. It is essential in Bayesian inference because it replaces intractable posterior integrals with averages over samples, provided the chain mixes well enough.
Reparameterization Trick (VAE)The reparameterization trick writes a stochastic latent sample as a differentiable transformation of parameters and noise, typically z equals mu plus sigma times epsilon. This lets gradients flow through sampling and makes variational autoencoder training practical with backpropagation.
AutoencodersAutoencoders are neural networks trained to reconstruct their inputs after passing them through a compressed or otherwise constrained latent representation. They are useful because the bottleneck forces the model to learn structure in the data rather than just memorize an identity map.
GAN Minimax ObjectiveThe GAN minimax objective sets up a two-player game in which a generator tries to produce samples that fool a discriminator, while the discriminator tries to distinguish real from generated data. At equilibrium the generator matches the data distribution, though the training game is often unstable in practice.
Q-LearningQ-learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function by bootstrapping from a Bellman target over the best next action. Because its update does not require following the current policy, it became a foundational method in both tabular RL and DQN-style deep RL.
The Bellman EquationThe Bellman equation recursively expresses the value of a state or state-action pair as immediate reward plus discounted expected future value. It is the backbone of dynamic programming and reinforcement learning because it turns long-horizon return into a local consistency condition.
The Markov PropertyThe Markov property says that the conditional distribution of the future depends only on the present state, not on the full past history, once the state is known. It is the defining assumption behind Markov chains and MDPs and tells you when a state representation is sufficient for planning.
Bootstrap Aggregating (Bagging)Bootstrap Aggregating trains multiple models on bootstrap-resampled versions of the training set and averages their predictions to reduce variance. It helps most with unstable base learners such as decision trees, which is why it underlies random forests.
Gradient Boosting Machines (GBM)Gradient Boosting Machines build an additive model by fitting each new weak learner to the negative gradient of the current loss. In practice each stage focuses on correcting the remaining errors of the ensemble, which makes boosting powerful but sensitive to overfitting if trees and learning rates are not controlled.
The Fisher Information MatrixThe Fisher Information Matrix measures how sensitive a model's log-likelihood is to changes in its parameters and therefore captures local statistical curvature. It underlies asymptotic variance bounds and natural gradient methods because it defines a geometry tied to the model's predictive distribution.
AdaBoost (Adaptive Boosting)AdaBoost builds an ensemble by repeatedly fitting weak learners to reweighted data so that previously misclassified examples receive more attention. Its final predictor is a weighted vote of the learners, and its power comes from turning many slightly better-than-random classifiers into a strong one.
Information Gain (Decision Trees)Information gain is the reduction in entropy achieved by splitting a dataset on a candidate feature. Decision-tree algorithms use it to choose splits that most reduce label uncertainty, though raw information gain can be biased toward features with many distinct values.
Singular Value Decomposition (SVD)Singular Value Decomposition factors any matrix into orthogonal directions and nonnegative singular values. It is fundamental because low-rank approximation, PCA, pseudoinverses, compression, and many denoising methods all follow from that decomposition.
Maximum A Posteriori (MAP) EstimationMaximum a posteriori estimation chooses the parameter value that maximizes posterior probability given the data. It is equivalent to maximum likelihood plus a log-prior regularizer, which is why MAP connects Bayesian estimation to familiar penalized optimization objectives.
Supervised learningSupervised learning trains a model on labeled input-output pairs so it can predict the correct target on new examples from the same distribution. Classification and regression are its two main forms, depending on whether the target is discrete or continuous.
Unsupervised learningUnsupervised learning tries to discover structure in data without labeled targets, such as clusters, latent factors, or a density model. It is used for representation learning, dimensionality reduction, clustering, and generative modeling when explicit supervision is unavailable.
Reinforcement learning (RL)Reinforcement learning studies how an agent should act through trial and error to maximize cumulative reward in an environment. Unlike supervised learning, feedback is delayed and depends on the agent's own actions, so the problem is about sequential decision-making as much as prediction.
Linear functionA linear function satisfies additivity and homogeneity, so it can be written as a matrix map with no bias term. In machine learning people often use 'linear' loosely for affine maps, but mathematically the distinction matters because adding a bias breaks true linearity.
Affine transformationAn affine transformation is a linear map followed by a translation, so it has weights and a bias. Dense neural network layers are affine rather than strictly linear, because the bias lets the model shift activations and decision boundaries.
Loss functionA loss function maps a model's prediction and the true target to a scalar error signal that training aims to minimize. It defines what the model is optimized for, so changing the loss changes which mistakes are treated as costly.
Mean squared error (MSE)Mean squared error averages the squared difference between predicted and true values, making large errors count disproportionately more than small ones. For regression it is especially important because minimizing MSE is equivalent to maximum likelihood under Gaussian noise.
Prediction errorPrediction error is the difference between a model's prediction and the true target for an example. It is the atomic quantity from which losses, residual analysis, and generalization metrics are built.
Gradient descentGradient descent minimizes a differentiable objective by repeatedly moving parameters in the direction of steepest local decrease, namely the negative gradient. Its step size is set by the learning rate, so convergence depends on both objective geometry and update scale.
Stochastic gradient descent (SGD)Stochastic gradient descent updates parameters using a gradient estimate from one example or a very small random batch, making each step noisy but cheap. That noise can slow exact convergence yet often helps large models optimize and generalize in practice.
ConvergenceIn optimization, convergence means an algorithm's iterates approach a stable solution or stationary point as updates continue. In practice people often mean that the loss or parameters stop changing much, though true convergence depends on the objective and algorithmic assumptions.
GeneralizationGeneralization is a model's ability to perform well on unseen data from the same underlying distribution as its training data. It is the real goal of learning, because low training error alone can come from memorization rather than useful structure.
RegularizationRegularization is any technique that biases learning toward simpler, more stable, or less overfit solutions. It can appear as an explicit penalty such as weight decay or as an implicit training choice such as data augmentation, dropout, or early stopping.
L1 regularization (Lasso)L1 regularization adds a penalty proportional to the sum of absolute parameter values, encouraging many coefficients to become exactly zero. That sparsity makes Lasso useful when feature selection is part of the goal, not just shrinkage.
L2 regularization (Ridge/Weight Decay)L2 regularization adds a penalty proportional to the sum of squared parameter values, shrinking weights toward zero without usually making them exactly sparse. In plain SGD it is equivalent to weight decay and is widely used because it improves stability and reduces variance.
BackpropagationBackpropagation computes gradients of a scalar loss with respect to all network parameters by applying the chain rule backward through the computation graph. It makes deep learning practical because it turns a complicated nested function into reusable local gradient calculations.
Forward passThe forward pass is the computation that maps input data through the model to produce activations and an output prediction. During training it also caches intermediate values needed later by the backward pass.
Backward passThe backward pass propagates gradients from the loss back through the computation graph to determine how each parameter affected the final error. It uses stored forward-pass intermediates and the chain rule to accumulate derivatives efficiently.
Chain ruleThe chain rule gives the derivative of a composition of functions by multiplying local derivatives along the computation path. It is the mathematical principle that backpropagation applies at scale throughout a neural network.
Automatic differentiationAutomatic differentiation computes exact derivatives of a program by systematically composing derivatives of its primitive operations. Unlike symbolic differentiation it does not manipulate formulas, and unlike numerical differentiation it does not rely on finite-difference approximations.
Computational graphA computational graph represents a calculation as nodes for variables or operations and edges for data dependencies. It is useful because the same graph that defines the forward computation can also be traversed backward to perform automatic differentiation.
Neural networkA neural network is a parameterized function built by composing affine transformations with nonlinear activations across layers. Its power comes from learning representations from data rather than relying on hand-crafted features for each task.
OverfittingOverfitting happens when a model fits patterns specific to the training set, including noise, better than it captures the underlying data-generating structure. The usual symptom is low training error paired with substantially worse validation or test error.
UnderfittingUnderfitting happens when a model is too limited, too constrained, or too poorly trained to capture the main structure in the data. It usually shows up as high error on both training and validation data, indicating high bias rather than variance.
NeuronA neuron in a neural network computes a weighted sum of its inputs, adds a bias, and applies an activation function. Collections of neurons form layers, so a single neuron's role is simple even though many together can represent complex functions.
Activation functionAn activation function is the nonlinear mapping applied after an affine transformation in a neural network. It is what prevents a stack of layers from collapsing into one affine map, enabling deep networks to approximate complex functions.
SigmoidThe sigmoid function maps a real number to a value between zero and one, making it easy to interpret as a probability or gate. Its downside is saturation at large positive or negative inputs, which can cause vanishing gradients in deep networks.
TanhThe tanh function maps inputs to the range minus one to one and is zero-centered, which often makes optimization easier than with the sigmoid. Like the sigmoid, however, it still saturates at large magnitudes and can cause vanishing gradients.
Deep neural networkA deep neural network is a neural network with multiple hidden layers rather than just one or two. The extra depth lets it build hierarchical features and represent complex functions more efficiently than shallow networks in many settings.
Composite functionA composite function applies one function to the output of another, such as f of g of x. Neural networks are composite functions at scale, which is why gradients are computed by repeatedly applying the chain rule.
Feedforward neural networkA feedforward neural network is a network whose computations move from input to output without recurrent cycles. Each layer depends only on earlier activations in the same pass, making feedforward networks the basic template for MLPs and many vision models.
Recurrent neural network (RNN)A recurrent neural network processes sequences by maintaining a hidden state that is updated one step at a time from the current input and previous state. This gives it a notion of temporal memory, but plain RNNs are hard to train on long dependencies because gradients can vanish or explode.
Hidden stateThe hidden state is the internal representation a sequential model carries forward as it processes inputs over time. In an RNN or LSTM it summarizes relevant past context, and in broader neural architectures it usually means a layer's intermediate activation vector.
Backpropagation through time (BPTT)Backpropagation through time trains a recurrent network by unrolling it across sequence steps and applying backpropagation to the resulting deep computational graph. It exposes how earlier states influence later losses, but long unrolls make optimization and memory use difficult.
What is the vanishing gradient problem?The vanishing gradient problem is the tendency for gradients propagated through many layers or time steps to shrink exponentially, making early parameters learn extremely slowly. It is especially severe in deep sigmoid or tanh networks and was a main motivation for LSTMs, better initialization, and residual connections.
Long short-term memory (LSTM)Long short-term memory is a gated recurrent architecture designed to preserve information over long timescales. Its input, forget, and output gates regulate a cell state with near-linear self-connections, which helps prevent the vanishing-gradient behavior of simple RNNs.
Gated recurrent unit (GRU)A gated recurrent unit is a recurrent architecture that uses update and reset gates to control how much past information is kept and how much new input is written into the hidden state. It is simpler than an LSTM because it has no separate cell state, yet it often achieves similar sequence-modeling performance.
Embedding layerAn embedding layer maps discrete IDs such as words, subwords, or items to learned dense vectors. It is essential whenever symbolic inputs must be represented in a continuous space that gradient-based models can manipulate.
Embedding vectorAn embedding vector is the dense continuous representation assigned to a discrete token, item, or entity by an embedding table or model. Its meaning comes from geometry: similar entities tend to occupy nearby directions or neighborhoods in the learned space.
Word embeddingA word embedding is a dense vector representation of a word learned from distributional context rather than hand-coded features. Its purpose is to place semantically or syntactically related words near one another in vector space so downstream models can generalize across vocabulary items.
Word2VecWord2Vec is a family of shallow neural methods that learn word embeddings from local context, most famously via the skip-gram and CBOW objectives. Its importance is that simple predictive training on large text corpora produced useful semantic geometry, including analogy-like linear regularities.
Skip-gramSkip-gram trains a model to predict surrounding context words from a center word. It learns embeddings that are especially good for capturing rare-word semantics because each observed word directly becomes a prediction source for many context targets.
FastTextFastText extends Word2Vec by representing a word as a bag of character n-gram embeddings rather than as a single atomic vector. That lets it model morphology and produce reasonable embeddings for rare or even unseen words.
Semantic similaritySemantic similarity is the degree to which two words, sentences, or documents share meaning rather than just surface form. In machine learning it is often estimated with embeddings and cosine similarity, which turns meaning comparison into a geometric problem.
Cosine similarityCosine similarity measures the angle between two vectors: \( \cos \theta = x \cdot y / (\|x\| \|y\|) \). It ignores magnitude and compares direction, which is why it is the default similarity metric for embeddings in retrieval, clustering, and semantic search.
Bag of wordsBag of words represents a document by counts or weights of vocabulary terms while discarding word order and syntax. It is simple, sparse, and historically central to information retrieval and document classification, but it cannot distinguish sentences with the same words in different orders.
Document-term matrixA document-term matrix is a matrix whose rows are documents, columns are vocabulary terms, and entries are counts or weights such as TF-IDF. It is the core data structure behind bag-of-words retrieval, topic modeling, and many classical NLP pipelines.
TF-IDFTF-IDF weights a term by how frequent it is in a document and how rare it is across the corpus, typically \( tf(w,d) \log(N/df(w)) \). It downweights ubiquitous words and highlights terms that are especially informative for a given document.
SparsitySparsity means most entries in a vector, matrix, or parameter set are exactly zero. In ML it matters because sparse representations save memory and computation, and because sparsity-inducing penalties such as L1 can make models more interpretable.
Dense vectorA dense vector is a low- or moderate-dimensional representation in which most entries are nonzero. Dense vectors are usually learned embeddings, so they capture semantic similarity better than sparse count vectors but are harder to interpret directly.
Sparse vectorA sparse vector has very few nonzero entries relative to its dimensionality. Classical text features such as bag-of-words and TF-IDF are sparse, which makes them memory-efficient and interpretable even when the feature space is huge.
One-hot encodingOne-hot encoding represents a categorical variable as a binary vector with exactly one 1 and all other entries 0. It preserves category identity without implying any ordering, but its dimensionality grows linearly with the number of categories.
TokenA token is the discrete unit a language model reads and predicts. Depending on the tokenizer, a token may be a word, subword, byte, punctuation mark, or special control symbol, and token count determines both context usage and API cost.
CorpusA corpus is a structured collection of text used to train, fine-tune, or evaluate language models. Its size, quality, domain mix, and cleaning decisions strongly shape what a model knows, how it generalizes, and which biases it inherits.
N-gramAn n-gram is a contiguous sequence of \( n \) tokens, such as a bigram for \( n=2 \) or trigram for \( n=3 \). N-grams are the basic units of classical language models and many text features because they capture short-range local context.
Count-based language modelA count-based language model estimates sequence probabilities from n-gram counts in a corpus, then uses smoothing or backoff for unseen events. It was the dominant pre-neural approach to language modeling, but it struggles with long context and data sparsity.
Language modelA language model assigns probabilities to token sequences, or equivalently predicts missing or next tokens from context. This unifies classical n-gram models, masked models like BERT, and autoregressive LLMs such as GPT under one probabilistic framework.
PerplexityPerplexity is the exponentiated average negative log-likelihood of a test sequence, so lower perplexity means the model is less surprised by the data. It is a standard intrinsic metric for language models, though low perplexity does not guarantee downstream usefulness.
Log-likelihoodLog-likelihood is the logarithm of the probability a model assigns to the observed data under given parameter values. Taking logs turns products into sums, making estimation numerically stable and turning maximum likelihood into a tractable optimization problem.
Negative log-likelihoodNegative log-likelihood is the loss obtained by negating the log-likelihood, so maximizing probability becomes minimizing a positive objective. It is the standard training loss for probabilistic classifiers, language models, and many generative models.
Maximum likelihood estimate (MLE)The maximum likelihood estimate selects the parameter values that make the observed data most probable under the model. Many standard ML objectives, including cross-entropy for classification and next-token prediction for LLMs, are just MLE written as minimization of negative log-likelihood.
Laplace smoothingLaplace smoothing adds a small constant, often 1, to every discrete count before normalizing probabilities. It prevents zero-probability events in models such as naive Bayes and n-gram LMs, though it can over-smooth when the vocabulary is large.
Conditional probabilityConditional probability is the probability of an event after restricting attention to cases where another event is known to occur, written \( P(A \mid B) = P(A,B)/P(B) \). It is the basic object behind Bayes' rule, autoregressive models, and all context-dependent prediction.
Discrete probability distributionA discrete probability distribution assigns nonnegative probabilities to a countable set of outcomes that sum to 1. Softmax outputs in classification and next-token prediction are discrete distributions over labels or vocabulary items.
Backoff (N-gram backoff)Backoff is an n-gram smoothing strategy that uses a high-order estimate when it has enough evidence and otherwise falls back to a lower-order n-gram. It handles sparsity by preferring specific context when available without assigning zero probability to unseen sequences.
Zipf's lawZipf's law says a word's frequency is roughly inversely proportional to its rank in the frequency table. This heavy-tailed structure explains why a few tokens dominate corpora, why vocabularies keep growing with more data, and why tokenization and smoothing are central in NLP.
Cross-entropyCross-entropy measures the average coding cost of samples from a true distribution \( p \) when encoded using a model distribution \( q \). In ML it is the standard loss for classification and language modeling, and minimizing it is equivalent to maximum likelihood up to an entropy constant.
Binary cross-entropyBinary cross-entropy is the cross-entropy loss for a Bernoulli target, typically \( -[y\log \hat p + (1-y)\log(1-\hat p)] \). It is the standard loss for binary classification and for multi-label problems where each label is predicted independently.
LogitA logit is the raw score before sigmoid or softmax normalization. In binary settings, the logit is also the log-odds \( \log\frac{p}{1-p} \), which is why linear models such as logistic regression operate naturally in logit space.
ClassificationClassification is a supervised learning task in which the target is a discrete label rather than a continuous value. The model learns decision boundaries that separate classes, often outputting calibrated class probabilities as well as the predicted label.
Binary classificationBinary classification is classification with exactly two classes, usually framed as predicting the probability of a positive class. It is commonly trained with logistic regression or a sigmoid output and binary cross-entropy loss.
Multiclass classificationMulticlass classification assigns each input to exactly one of \( K>2 \) mutually exclusive classes. Models usually produce a softmax distribution over classes and train with cross-entropy against a one-hot or label-smoothed target.
RegressionRegression is a supervised learning task where the target is continuous rather than categorical. The model predicts a numeric value, and common losses such as mean squared error correspond to assumptions about the noise model, especially Gaussian noise.
Linear RegressionLinear regression models a target as an affine function of the inputs, typically \( y \approx w^\top x + b \), and fits the parameters by minimizing squared residuals. It is the canonical baseline for regression because it is interpretable and often has a closed-form OLS solution.
Logistic RegressionLogistic regression is a linear classifier that models the log-odds of a class as \( w^\top x + b \) and maps that score through a sigmoid to get a probability. Despite its name, it is a classification model, not a regression model.
AccuracyAccuracy is the fraction of predictions that are correct, \( (\text{TP}+\text{TN})/N \). It is intuitive and useful when classes are balanced, but it can be badly misleading on imbalanced datasets where always predicting the majority class already yields high accuracy.
PrecisionPrecision is the fraction of predicted positives that are truly positive, \( \text{TP}/(\text{TP}+\text{FP}) \). It matters most when false positives are costly, such as spam filters, safety classifiers, or medical screening follow-ups.
RecallRecall is the fraction of actual positives that the model successfully retrieves, \( \text{TP}/(\text{TP}+\text{FN}) \). It matters most when missing positives is costly, such as fraud detection, disease screening, or retrieval systems where relevant items should not be overlooked.
F1 ScoreF1 score is the harmonic mean of precision and recall, \( 2PR/(P+R) \). It is high only when both precision and recall are high, making it useful for imbalanced classification where accuracy hides the trade-off between false positives and false negatives.
ROUGEROUGE is a family of overlap metrics for summarization and generation, based on matching n-grams, longest common subsequences, or skip-bigrams between a candidate and reference text. It measures lexical recall more than semantic faithfulness, so it is informative but limited.
Longest Common SubsequenceThe longest common subsequence is the longest sequence of symbols that appears in two sequences in the same order, not necessarily contiguously. It underlies edit-distance-style dynamic programming and metrics such as ROUGE-L because it captures shared sequence structure beyond exact n-gram matches.
Edit DistanceEdit distance is the minimum number of insertions, deletions, and substitutions needed to transform one sequence into another. The most common version, Levenshtein distance, is a dynamic-programming measure of string similarity used in spelling correction, alignment, and evaluation.
PerceptronThe perceptron is a linear threshold classifier that predicts a class from the sign of \( w^\top x + b \) and updates its weights only on mistakes. It is historically important because it introduced gradient-like learning for linear separators, but it only converges when the data are linearly separable.
Decision TreeA decision tree predicts by recursively splitting the feature space with if-then tests until a leaf assigns a class or value. Trees are easy to interpret and capture nonlinearity, but a single deep tree has high variance and overfits without pruning or ensembling.
Random ForestA random forest is an ensemble of decision trees trained on bootstrap samples with random feature subsetting at each split. Averaging many decorrelated trees greatly reduces variance, which is why random forests are strong tabular baselines with little tuning.
Support Vector Machine (SVM)A support vector machine finds the decision boundary that maximizes the margin between classes, depending only on the support vectors nearest the boundary. With kernels, SVMs can model nonlinear separators while retaining a convex optimization objective.
Kernel MethodsKernel methods turn linear algorithms into nonlinear ones by replacing inner products with a kernel function that implicitly measures similarity in a higher-dimensional feature space. This is the core trick behind SVMs, kernel ridge regression, and Gaussian processes.
Principal Component Analysis (PCA)Principal component analysis finds orthogonal directions of maximal variance in the data and projects onto the top few of them. It is a linear dimensionality-reduction method that compresses data, denoises features, and reveals dominant global structure through eigenvectors of the covariance matrix.
Dimensionality ReductionDimensionality reduction maps data into fewer dimensions while preserving as much important structure as possible, such as variance, distances, or neighborhood relations. It is used for compression, visualization, denoising, and making downstream learning easier in high-dimensional spaces.
TransformerA Transformer is a sequence model built from self-attention, position-wise MLPs, residual connections, and normalization, rather than recurrence or convolution. Its key advantage is that every token can directly attend to every other token in parallel, which made modern LLM scaling practical.
Decoder BlockA decoder block is the basic unit of a decoder-only Transformer: causal self-attention plus a position-wise MLP, wrapped with residual connections and normalization. Stacking these blocks lets the model mix context across tokens while preserving autoregressive generation.
Decoder-only TransformerA decoder-only Transformer is a Transformer architecture composed only of masked self-attention blocks, so each token can attend only to earlier tokens. This makes it the standard architecture for autoregressive language models such as GPT, LLaMA, and Claude.
Self-AttentionSelf-attention lets each token compute a weighted combination of representations from other tokens in the same sequence, with weights determined by query-key similarity. It is the mechanism that gives Transformers flexible, content-dependent context mixing without recurrence.
Attention ScoreAn attention score is the compatibility value computed between a query and a key before normalization, often by dot product or a learned variant. Higher scores mean the corresponding token or memory slot should receive more weight after the softmax.
What is a scaled attention score?A scaled attention score is a query-key dot product divided by \( \sqrt{d_k} \) before softmax. The scaling keeps the variance of the logits from growing with key dimension, which helps prevent softmax saturation and keeps gradients well behaved.
Masked Attention ScoreA masked attention score is an attention logit after adding a mask that blocks forbidden positions, typically by adding a very large negative value before softmax. This forces the resulting attention weight to be effectively zero at those positions.
Attention WeightsAttention weights are the normalized coefficients, usually produced by a softmax over attention scores, that determine how much each value vector contributes to the output. They form a distribution over positions or memory entries for each query.
Causal MaskA causal mask blocks attention to future positions by masking entries above the sequence diagonal. It enforces left-to-right autoregressive prediction, ensuring that token \( t \) can depend only on tokens \( \le t \).
Multi-Head AttentionMulti-head attention runs several attention mechanisms in parallel on different learned projections of the same input, then concatenates their outputs. This lets the model capture multiple relational patterns at once instead of forcing all interactions through a single attention map.
Attention HeadAn attention head is one parallel query-key-value attention computation inside multi-head attention. Different heads can specialize to different patterns, such as local syntax, long-range dependencies, or induction-like copying behavior.
Query, Key, Value (QKV)Query, key, and value are the three learned projections used by attention: the query asks what to look for, the key says what each position offers, and the value is the content returned if that position is attended to. Attention weights come from query-key similarity, but outputs are weighted sums of values.
Projection MatrixA projection matrix is a learned linear map that transforms vectors into another representation space. In Transformers, separate projection matrices create Q, K, and V from hidden states, and another projection maps concatenated head outputs back to the model dimension.
Position-wise MLPA position-wise MLP is the feed-forward sublayer in a Transformer block, applied independently to each token after attention. It adds nonlinearity and channel mixing per token, complementing attention, which mixes information across positions.
Residual Connection (Skip Connection)A residual connection adds a layer's input back to its output, so the layer learns a correction rather than an entirely new representation. This stabilizes optimization, improves gradient flow, and is one reason very deep networks and Transformers train reliably.
Context WindowThe context window is the maximum number of tokens a model can process in one forward pass. It defines the model's accessible working memory at inference time, and longer windows increase both usefulness on long documents and computational cost.
AutoregressionAutoregression is the factorization of a sequence distribution into a product of conditional next-step distributions. In language generation it means producing one token at a time, each conditioned on all previously generated tokens.
PromptA prompt is the text or structured input given to a language model to condition its behavior and output. It can provide instructions, examples, retrieved context, or tool schemas, and in practice it acts as the model's temporary task specification.
Few-Shot PromptingFew-shot prompting includes a small number of labeled examples in the prompt so the model can infer the task from context without updating parameters. It is one of the clearest demonstrations of in-context learning in large language models.
In-Context LearningIn-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
Chain of ThoughtChain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
Tree of ThoughtTree of Thought extends chain-of-thought by exploring multiple candidate reasoning paths, evaluating intermediate states, and searching over them with strategies such as BFS or DFS. It is useful when solving the task requires branching, backtracking, or comparing alternative partial plans.
Self-ConsistencySelf-consistency samples multiple reasoning traces for the same problem and chooses the most common final answer rather than trusting a single chain of thought. It often boosts accuracy because different samples make different mistakes, while the correct answer tends to recur.
ReAct (Reason + Act)ReAct is a prompting pattern where a model alternates between reasoning in text and taking actions such as search or tool calls. This lets it use external information and observations to update its plan instead of reasoning only from the original prompt.
PretrainingPretraining is the large-scale first stage of training where a model learns general-purpose representations from unlabeled or self-supervised data. For LLMs this usually means next-token prediction over massive corpora, producing a base model that later fine-tuning can adapt.
Base ModelA base model is the pretrained model before instruction tuning, chat alignment, or task-specific fine-tuning. It is usually optimized only for language modeling, so it can complete text well but may not reliably follow user instructions or safety constraints.
HallucinationHallucination is when a model produces content that is unsupported or false while presenting it as if it were correct. In language models it often comes from next-token training, weak grounding, or overconfident decoding rather than deliberate deception.
MisalignmentMisalignment is the failure mode where optimizing a model for its training objective or proxy reward does not produce the behavior humans actually want. It includes problems like reward hacking, unsafe shortcuts, and goal pursuit that diverges from the intended specification.
Bias (Fairness)In fairness contexts, bias means systematic differences in treatment or error rates across groups caused by data, labels, measurement, or deployment choices. Fairness asks which notion of equal treatment matters, and different fairness criteria often cannot all be satisfied at once.
ExplainabilityExplainability is the ability to give a human-understandable reason for a model’s prediction or behavior using features, examples, rules, or mechanisms. A good explanation should be useful to a person and, ideally, faithful to what the model actually used.
Task VectorA task vector is the weight difference between a pre-trained model and the same model after fine-tuning on a task. Adding, subtracting, or scaling that vector can steer behavior, so task vectors provide a simple weight-space tool for editing or combining capabilities.
Preference-Based AlignmentPreference-based alignment trains models from judgments such as ‘response A is better than response B’ instead of only from supervised targets. It is useful when desired behavior is easier for humans to compare than to specify as a single correct answer.
Reinforcement Learning from Human Feedback (RLHF)RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
Reward ModelA reward model predicts a scalar preference score for a candidate response, usually from pairwise human comparisons. In RLHF it acts as a learned proxy objective, so the policy can exploit its mistakes if optimization pushes too hard against it.
Self-CritiqueSelf-critique is a prompting or training pattern where a model reviews its own draft, identifies problems, and then revises the answer. It can improve reasoning and safety, but only when the model can recognize errors more reliably than it makes them.
Constitutional AIConstitutional AI aligns a model using an explicit list of principles that guide critique and revision, reducing the need for dense human feedback on every example. The constitution acts like a rule set for self-improvement, though the resulting behavior still depends on the chosen principles and training procedure.
Direct Preference Optimization (DPO)DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
Bradley-Terry ModelThe Bradley-Terry model turns pairwise comparisons into latent scores by assuming the probability that item A beats item B depends on their score difference. It is widely used for preference modeling, ranking, and reward-model training from pairwise judgments.
Pairwise ComparisonA pairwise comparison asks which of two items is better instead of assigning each item an absolute score. These judgments are often easier and more consistent for humans, which is why they are common in ranking, Elo-style systems, and alignment datasets.
RankingRanking is the task of ordering items by relevance, preference, or utility rather than predicting a single class label. It appears in search, recommendation, and alignment because the main question is which outputs should be placed above others.
Elo RatingElo rating estimates skill from pairwise wins and losses by updating each participant’s score based on expected versus actual outcomes. It was designed for chess, but the same logic is used to aggregate model preferences and benchmark head-to-head evaluations.
Pairwise RankingPairwise ranking learns an ordering from relative preferences between pairs rather than from absolute target values. Many ranking losses optimize the probability that preferred items score above rejected ones, which fits search and alignment data naturally.
Bootstrap ResamplingBootstrap resampling estimates uncertainty by repeatedly sampling with replacement from an observed dataset and recomputing a statistic on each resample. It is useful when analytic uncertainty formulas are hard to derive, though it assumes the sample is reasonably representative.
Confidence IntervalA confidence interval is a range produced by a procedure that would contain the true parameter a fixed fraction of the time over repeated samples, such as 95%. It quantifies estimation uncertainty, but it is not the probability that the parameter lies in this particular realized interval.
Floating-Point Operations (FLOPs)FLOPs count the number of floating-point arithmetic operations required by a model or workload. They are a useful compute proxy for comparing training or inference cost, though real speed also depends on memory traffic, parallelism, and hardware utilization.
Attention MechanismAttention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
Encoder-Decoder ArchitectureAn encoder-decoder architecture uses an encoder to turn an input sequence into representations and a decoder to generate an output sequence conditioned on those representations. It is the standard design for translation, summarization, and other input-to-output generation tasks.
Sequence-to-Sequence (Seq2Seq)Sequence-to-sequence learning maps one sequence to another, often with different lengths, such as translation or summarization. Modern seq2seq models are usually encoder-decoder Transformers, though earlier versions used recurrent networks with attention.
Distributed RepresentationA distributed representation stores a concept as a pattern across many features or neurons rather than in a single symbolic slot. This supports similarity, composition, and generalization because related concepts can occupy nearby regions of representation space.
Representation LearningRepresentation learning is the process of learning useful features automatically from data rather than hand-engineering them. Good representations preserve the structure that downstream tasks need, such as semantic similarity, invariances, or factors of variation.
Latent SpaceA latent space is the internal feature space in which a model represents inputs after transformation, often in a form that is more compact or task-relevant than raw data. Distances or directions in latent space can encode meaningful variation, but only relative to the model and objective that learned it.
Embedding SpaceAn embedding space is the vector space produced by an embedding model, where tokens, sentences, images, or other objects are mapped to dense numerical representations. Similarity in that space is used for retrieval, clustering, and transfer, though the geometry depends on the training objective.
BM25BM25 is a sparse retrieval scoring function that ranks documents using term matches weighted by inverse document frequency and document-length normalization. It remains strong for exact lexical search and is often combined with dense retrieval in hybrid systems.
GroundingGrounding means tying a model’s answer to external evidence, inputs, or world state rather than letting it generate from unsupported priors alone. In RAG or tool-use systems, grounding is what makes outputs traceable to retrieved context or observations.
FactualityFactuality is whether the content of an answer is actually true in the world or according to trusted sources. An answer can be fluent and even faithful to its source while still being nonfactual if the source itself is wrong or outdated.
FaithfulnessFaithfulness is whether a model’s output is supported by the provided input, source document, or chain of evidence. It differs from factuality because a summary can be perfectly faithful to a source that contains false claims.
CalibrationCalibration measures whether predicted probabilities match observed frequencies, so events predicted at 70% should occur about 70% of the time. A model can be accurate but poorly calibrated if its confidence is systematically too high or too low.
AdaGradAdaGrad adapts learning rates by dividing each parameter’s update by the square root of the accumulated historical squared gradients. It works especially well for sparse features, but its learning rates can decay too aggressively over long training runs.
MomentumMomentum accumulates a running velocity of past gradients so updates keep moving in consistent directions and damp noisy zig-zags. It speeds optimization in ravines and is commonly paired with SGD or Nesterov variants.
Weight InitializationWeight initialization chooses starting parameter values before training begins. Good initialization keeps activations and gradients in useful ranges so learning can start without vanishing, exploding, or breaking symmetry.
He Initialization (Kaiming Initialization)He initialization sets weight variance to roughly 2/fan-in so ReLU-like activations preserve signal magnitude through depth. It improves on Xavier initialization for one-sided activations that zero out about half the inputs.
Continual LearningContinual learning is the problem of learning from a sequence of tasks or data distributions without losing previously acquired capabilities. Its core challenge is the stability-plasticity tradeoff: the model must remain adaptable to new data without catastrophically overwriting old knowledge.
What is catastrophic forgetting?Catastrophic forgetting is the sharp loss of performance on previously learned tasks after a model is trained on new ones. It happens because gradient updates that help the new task can overwrite internal representations that were supporting the old task.
Reward SignalA reward signal is the scalar feedback an RL agent receives about the desirability of its behavior. Because the agent optimizes whatever reward it is given, the design of the reward signal determines whether learning produces the intended behavior or merely exploits a proxy.
Policy (Reinforcement Learning)In reinforcement learning, a policy is the rule that maps states or observations to actions, often as a probability distribution. Learning a policy means directly improving behavior, and in language-model RL the policy is the model’s distribution over tokens or completions conditioned on context.
Value FunctionA value function estimates expected future return, either from a state or from a state-action pair under a policy. It matters because it turns delayed rewards into local training signals, enabling planning, bootstrapping, and lower-variance policy gradients.
InterpretabilityInterpretability is the study of making model behavior understandable to humans, whether by explaining predictions, revealing learned features, or analyzing internal structure. It matters because debugging, trust, scientific understanding, and safety all depend on seeing more than just inputs and outputs.
Mechanistic InterpretabilityMechanistic interpretability treats a neural network as a system to be reverse-engineered into circuits, features, and algorithms. Its goal is not just to correlate neurons with concepts, but to identify the actual internal computations that produce behavior.
Attention VisualizationAttention visualization renders attention weights as heatmaps or token-to-token graphs so we can see which positions a model attends to. It is a useful diagnostic tool, but attention weights alone are not a complete explanation of what the model is computing.
Probing (Neural Networks)Probing tests whether information is encoded in a model’s hidden states by training a simple classifier or regressor on those representations. A successful probe shows that the information is recoverable, but not necessarily that the model causally uses it.
Logit LensLogit Lens maps intermediate hidden states through the final unembedding matrix to inspect what tokens each layer already appears to favor. It is a convenient way to watch a Transformer’s computation unfold, though it is only approximate because earlier layers were not trained to be decoded directly.
Activation AnalysisActivation analysis studies the intermediate activations produced during a forward pass rather than only the model’s static weights. By examining which neurons, channels, or directions fire in different contexts, it helps connect internal representations to model behavior.
Sparse Autoencoder (Mechanistic Interpretability)In mechanistic interpretability, a sparse autoencoder is trained on model activations to decompose dense, superposed representations into a larger set of sparse features. This often makes latent structure more interpretable, because individual learned directions can line up with human-readable concepts or behaviors.
Superposition (Neural Networks)Superposition is the phenomenon in which a network stores more features than it has obvious dimensions by packing them into overlapping directions. It explains why single neurons can look polysemantic and why sparse feature dictionaries are often more informative than neuron-by-neuron inspection.
Scaling LawsScaling laws are empirical relationships showing how loss or capability changes with model size, data, and compute, often following approximate power laws. They matter because they let researchers forecast returns to scale and choose more compute-efficient training regimes.
What are emergent capabilities in large language models?Emergent capabilities in large language models are abilities that look weak or absent at small scale but become strong once the model is large enough. The key caveat is that “emergence” can depend on the metric and threshold used, so apparent jumps are not always literal discontinuities in the underlying capability.
One-Shot LearningOne-shot learning is the ability to learn or generalize from a single labeled example or demonstration. It matters because many real tasks do not provide large datasets, so the model must infer the rule from minimal evidence.
Transfer LearningTransfer learning reuses knowledge learned on one task or dataset to improve performance on another. It is effective because useful features learned in a high-resource setting often remain useful in a lower-resource target domain.
Active LearningActive learning is a training strategy that selectively asks for labels on the most informative unlabeled examples instead of labeling data uniformly at random. Its purpose is to reduce annotation cost by spending human effort where uncertainty or disagreement is highest.
Neural Language ModelA neural language model predicts text with learned distributed representations and a neural network rather than count tables. Its main advantage over classical n-gram models is that it can generalize to unseen contexts by sharing statistical strength across similar words and patterns.
Semantic SpaceA semantic space is an embedding space in which geometric relations reflect meaning, similarity, or functional role. Nearby points tend to correspond to semantically related items, which is why vector search and representation learning work at all.
Logit AdjustmentLogit adjustment means modifying logits to account for effects such as class imbalance, prior shift, or calibration goals before taking probabilities or losses. It changes the decision boundary in a simple way by shifting scores rather than changing the underlying representation.
Softmax NormalizationSoftmax normalization converts a vector of logits into a probability distribution by exponentiating each score and dividing by the total. It preserves rank order while making outputs comparable, which is why it is the standard final normalization for multiclass prediction.
Over-ParameterizationOver-parameterization means a model has far more parameters than the minimal number apparently needed to fit the data. Counterintuitively, this often helps optimization and can still generalize well because training dynamics and implicit regularization matter as much as raw parameter count.
Model EvaluationModel evaluation is the systematic measurement of how a model performs, fails, and trades off across tasks, metrics, and deployment contexts. Good evaluation combines offline benchmarks, stress tests, human judgment, and online metrics rather than relying on a single score.
Feedback Loop (ML Systems)A feedback loop in an ML system occurs when the model’s outputs change the data it will later train on or be evaluated against. These loops can reinforce bias, distort demand, and make offline metrics look better even while the real system gets worse.
GRPO (Group Relative Policy Optimization)GRPO is a policy-optimization method that scores sampled responses relative to others in the same group, using those relative rewards to update the policy. Its appeal is that it can improve reasoning performance while avoiding some of the memory overhead of PPO-style critic training.
Bias MitigationBias mitigation is the set of methods used to reduce unfair or systematically skewed behavior in models and datasets. It can act before training, during optimization, or after deployment, but every intervention trades off fairness goals, accuracy, and operational complexity.
Steering VectorsSteering vectors are directions in activation space that, when added to hidden states, systematically change model behavior toward traits such as refusal, sentiment, or persona. They are useful because they show that some behaviors can be modified directly in representation space without full retraining.
Activation PatchingActivation patching is a causal analysis method where activations from one run are inserted into another to test which components matter for a given behavior. If patching a layer or head restores the behavior, that component is evidence for being on the relevant causal path.
Log-Sum-Exp TrickThe log-sum-exp trick computes expressions like log(sum(exp(x_i))) stably by subtracting the maximum logit before exponentiation. It prevents overflow and underflow, so it is a standard numerical tool in softmax, cross-entropy, and probabilistic inference.
Exponential Family of DistributionsThe exponential family is the class of distributions that can be written in the form exp(eta^T T(x) - A(eta) + c(x)). This shared form gives them sufficient statistics, convenient conjugate priors, and clean maximum-likelihood geometry, which is why they dominate classical statistical modeling.
Kolmogorov-Arnold NetworksKolmogorov-Arnold Networks replace fixed scalar weights on edges with learnable one-dimensional functions, so layers are built from sums of learned univariate transforms rather than simple affine maps. They are motivated by the Kolmogorov-Arnold representation theorem and are often discussed as a more interpretable alternative to MLPs, not a universal replacement.
KL DivergenceKL divergence measures how one probability distribution differs from a reference distribution through an expected log-ratio. It is nonnegative and asymmetric, so it is best understood not as a distance but as the penalty for modeling samples from one distribution with another.
Jensen's InequalityJensen’s inequality says that for a convex function f, applying f after taking an expectation gives a value no larger than taking the expectation after applying f. This one fact underlies many core results in ML, including the nonnegativity of KL divergence and the derivation of variational lower bounds.
Bayes' TheoremBayes’ theorem updates beliefs by combining a prior with the likelihood of observed evidence to produce a posterior. In compact form, posterior is proportional to likelihood times prior, which is why Bayesian inference is fundamentally a rule for disciplined belief revision.
Expectation–Maximization (EM) AlgorithmThe EM algorithm is an iterative method for maximum-likelihood or MAP estimation in models with latent variables. Each round first estimates the hidden structure under the current parameters and then re-optimizes the parameters as if that hidden structure were known.
Eigenvalues and EigenvectorsFor a matrix A, an eigenvector is a nonzero direction that A only rescales, and the scaling factor is its eigenvalue. Eigenpairs matter because they reveal invariant directions, control stability, and make problems like PCA and spectral clustering possible.
Jacobian and HessianThe Jacobian collects first-order partial derivatives of a vector-valued function, while the Hessian collects second-order partial derivatives of a scalar function. Together they describe local sensitivity and curvature, which is why they are central to optimization and dynamical analysis.
Multivariate Gaussian DistributionThe multivariate Gaussian is the vector-valued generalization of the normal distribution, parameterized by a mean vector and covariance matrix. It is foundational because linear transformations, marginals, and conditionals all stay Gaussian, making analysis and inference unusually tractable.
Naive Bayes ClassifierNaive Bayes is a probabilistic classifier that applies Bayes’ theorem under the simplifying assumption that features are conditionally independent given the class. That assumption is usually false, but the model is still fast, data-efficient, and surprisingly effective for sparse text problems.
Universal Approximation TheoremThe universal approximation theorem says that a sufficiently wide neural network with a suitable nonlinearity can approximate any continuous function on a compact domain arbitrarily well. It is an existence result, not a guarantee that training will find that approximation efficiently.
Double DescentDouble descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.
GrokkingGrokking is a delayed generalization phenomenon in which a model first memorizes the training set and only much later snaps into a simple algorithm that generalizes well. It is interesting because the model already had enough capacity to fit the data, yet the more general solution emerged only after long training and regularization pressure.
Neural Tangent Kernel (NTK)The Neural Tangent Kernel is the kernel that describes how an infinitely wide network trained by small gradient steps evolves around its initialization. In that limit, training becomes equivalent to kernel regression, which explains part of the behavior of very wide networks.
Linear AttentionLinear attention is the family of attention mechanisms that rewrites or approximates softmax attention so sequence processing scales roughly linearly instead of quadratically with length. The benefit is efficiency on long contexts, but the tradeoff is that exact softmax behavior is usually lost.
Chinchilla Scaling LawsChinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
Muon OptimizerMuon is an optimizer designed especially for matrix-valued parameters that replaces the raw update direction with an orthogonalized one. The point is to respect matrix structure rather than treating every weight tensor as a flattened vector, with the goal of improving training efficiency relative to standard first-order optimizers.
REINFORCE AlgorithmREINFORCE is the basic Monte Carlo policy-gradient algorithm that updates parameters by weighting the log-probability of sampled actions by their returns. It is unbiased, but its variance is high, which is why practical methods usually add baselines or critics.
Generalized Advantage Estimation (GAE)Generalized Advantage Estimation is a family of advantage estimators that interpolates between low-variance temporal-difference updates and high-variance Monte Carlo returns using a parameter lambda. It is widely used because it gives a practical bias-variance tradeoff for policy-gradient training.
Actor–Critic MethodsActor-critic methods learn a policy and a value estimator at the same time. The actor chooses actions, while the critic estimates return and supplies a lower-variance learning signal than raw returns alone.
Process Reward Models (PRM) vs Outcome Reward Models (ORM)Outcome reward models score only the final answer, while process reward models score the intermediate steps of a solution. PRMs provide denser supervision and better guidance for search and long-form reasoning, but they require more fine-grained labels and more complex evaluation.
KTO (Kahneman–Tversky Optimization)KTO is a preference-optimization objective that learns from binary desirable-versus-undesirable labels instead of pairwise rankings. It uses a utility formulation inspired by prospect theory, making it a cheaper alternative when collecting full preference comparisons is too expensive.
Process SupervisionProcess supervision trains a model on the quality of intermediate reasoning steps rather than only on whether the final answer is correct. It improves credit assignment for long solutions and makes verification more local, though collecting reliable step-level labels is expensive.
Monte Carlo Tree Search for LLM ReasoningMonte Carlo Tree Search for LLM reasoning treats partial solution paths as tree nodes, expands candidate continuations, and uses rollouts or value estimates to decide where to search next. It is attractive because it turns one-shot generation into guided search over reasoning trajectories instead of committing immediately to a single chain of thought.
Self-Refine / ReflexionTwo closely related inference-time techniques in which an LLM critiques and revises its own output over multiple rounds. Self-Refine uses the same model in three roles (generate → feedback → refine); Reflexion adds an episodic memory of past failures to guide future trajectories in agentic tasks.
Attention Sinks / StreamingLLMAttention sinks are the first few tokens in a causal Transformer that absorb disproportionate attention from later positions, even when they carry little semantic content. StreamingLLM exploits this by keeping sink tokens and a short recent window in the KV cache, enabling long streaming inference with bounded memory.
Induction HeadsA specific two-head circuit in Transformer attention that copies the next token after a previous occurrence of the current token — the computational basis for in-context learning. Anthropic showed induction heads form suddenly during training, coinciding with the sharp jump in ICL ability.
Circuit AnalysisThe mechanistic-interpretability practice of identifying subgraphs of weights, residual-stream components, and attention heads that jointly implement a human-interpretable algorithm (indirect object identification, modular addition, greater-than). Circuit analysis produces falsifiable, causal accounts of what a network has learned.
ROME / MEMIT Model EditingRank-one edits to MLP weights that inject a single fact (ROME) or thousands of facts (MEMIT) into a pretrained LLM without retraining. They exploit the observation that MLP blocks act as key–value memories, identify the causal neurons via activation patching, and solve a closed-form optimisation problem for the minimal-norm weight update.
Denoising Diffusion Probabilistic Models (DDPM)A generative model that learns to reverse a fixed Gaussian corruption process. Ho et al. (2020) showed that predicting the added noise with a neural network, trained by a simple MSE loss on \( T \) diffusion steps, yields state-of-the-art image synthesis — the foundation of all modern image/video diffusion.
Classifier-Free GuidanceClassifier-free guidance is a sampling trick for conditional diffusion models that combines conditional and unconditional predictions to push samples harder toward the prompt. It improves prompt adherence without a separate classifier, but too much guidance can oversaturate images and reduce diversity.
Flow Matching / Rectified FlowA generative modelling framework that learns a time-dependent velocity field mapping noise to data along a fixed probability path. Rectified Flow in particular learns straight-line paths between noise and data samples, enabling 1–4 step sampling with quality matching much deeper diffusion models.
Chain Rule of ProbabilityThe factorisation \( p(x_1, \dots, x_n) = \prod_{i=1}^n p(x_i \mid x_{<i}) \) that decomposes any joint distribution into a product of conditionals. It is the mathematical bedrock of autoregressive language models, belief networks, and most tractable density estimation.
Central Limit TheoremGiven i.i.d. samples \( X_1, \dots, X_n \) with finite mean \( \mu \) and variance \( \sigma^2 \), the standardised sample mean \( \sqrt{n}(\bar X_n - \mu)/\sigma \) converges in distribution to \( \mathcal{N}(0,1) \). The CLT underlies confidence intervals, stochastic-gradient noise analysis, and many initialisation arguments.
Variance, Covariance and CorrelationVariance measures the spread of a single random variable; covariance measures joint variation of two; correlation normalises covariance to \( [-1, 1] \) and is scale-free. Together they form the second-order statistics that drive PCA, linear regression, Kalman filters, and most initialisation schemes.
Taylor Series ExpansionApproximates a smooth function near a point \( a \) by a polynomial whose coefficients are the function's derivatives: \( f(x) \approx \sum_{k=0}^{K} \frac{f^{(k)}(a)}{k!}(x-a)^k \). Provides the theoretical scaffolding for gradient descent (1st order), Newton's method (2nd order), Laplace approximations, and loss-landscape analysis.
Entropy vs Cross-Entropy vs KL (Unified View)A single identity \( H(p,q) = H(p) + D_{\text{KL}}(p \| q) \) ties the three together: entropy is the optimal code length under \( p \); cross-entropy is the code length using \( q \); KL is the extra cost of the mismatch. This view clarifies why minimising classification cross-entropy is equivalent to MLE and to minimising KL.
Mahalanobis DistanceThe metric \( d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})} \) measures distance in a space whitened by the covariance \( \Sigma \). It is the natural distance for Gaussian data, the log-likelihood core of a multivariate Gaussian, and the basis of Fisher's discriminant analysis.
Conjugate PriorsA prior is conjugate to a likelihood family if the posterior belongs to the same family as the prior. Conjugacy turns Bayesian updating into a closed-form parameter update and underlies analytical treatments of Beta–Binomial, Dirichlet–Multinomial, and Normal–Normal models.
Gaussian Mixture Models (GMM)A density model \( p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k) \) that represents data as a weighted sum of Gaussian components. Fitted by the EM algorithm, GMMs are the canonical example of latent-variable density estimation and the statistical cousin of \( k \)-means.
Variational Inference / ELBOA framework that turns Bayesian inference into optimisation: choose a tractable family \( q(\mathbf{z}; \phi) \) and maximise the evidence lower bound \( \mathcal{L}(\phi) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \), which simultaneously approximates the posterior and bounds the log-evidence. VAEs, variational Bayes, and amortised inference all descend from this objective.
Evidence Lower Bound (ELBO) — DerivationTwo derivations of the ELBO: (i) Jensen's inequality applied to \( \log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z})\,d\mathbf{z} \), and (ii) the identity \( \log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q \| p(\cdot \mid \mathbf{x})) \). Both produce the same bound, but the second makes the gap explicit.
Score MatchingAn estimation principle (Hyvärinen, 2005) that fits an unnormalised density by matching the model's score \( \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \) to the data's score. Integration-by-parts eliminates the unknown data-score, yielding a tractable objective that underlies modern score-based diffusion models.
Denoising Score MatchingVincent (2011) showed that the score of a Gaussian-corrupted data distribution \( q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) \) admits a closed-form target, reducing score learning to a simple regression: predict \( (\mathbf{x} - \tilde{\mathbf{x}})/\sigma^2 \). This identity is the algorithmic heart of modern diffusion models.
Feature Attribution (Integrated Gradients, SHAP)Post-hoc interpretability methods that attribute a model's prediction to its input features. Integrated Gradients (Sundararajan et al., 2017) integrates gradients along a path from a baseline to the input. SHAP (Lundberg & Lee, 2017) builds on Shapley values from cooperative game theory, giving a unique attribution with axiomatic fairness properties.
Monosemanticity and SAE FeaturesMonosemanticity is the idea that a single learned feature corresponds to a single interpretable concept rather than many unrelated ones. Sparse autoencoders are used to decompose dense neural activations into a wider sparse basis where such features are easier to identify.
Matrix Rank and the Rank–Nullity TheoremThe rank of \( A \in \mathbb{R}^{m \times n} \) is the dimension of its column space, equal to the dimension of its row space. The rank–nullity theorem states \( \text{rank}(A) + \text{nullity}(A) = n \), linking how much a linear map preserves to how much it collapses — the foundation for invertibility conditions, low-rank adaptation, and OLS identifiability.
Matrix Norms: Frobenius, Spectral, NuclearThe three dominant matrix norms measure different notions of size: Frobenius \( \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2} \) is the entrywise \( \ell_2 \); spectral (operator) \( \|A\|_2 = \sigma_{\max}(A) \) is the largest gain on unit vectors; nuclear \( \|A\|_* = \sum_i \sigma_i(A) \) is the convex surrogate for rank. Each appears in a distinct ML context: regularisation, Lipschitz bounds, and low-rank recovery respectively.
Positive (Semi-)Definite MatricesA symmetric matrix \( A \in \mathbb{R}^{n \times n} \) is positive semi-definite (PSD) if \( x^\top A x \ge 0 \) for all \( x \), and positive definite (PD) if strict for \( x \ne 0 \). Equivalent characterisations include non-negative eigenvalues and a Cholesky factorisation \( A = L L^\top \). PSD structure underlies covariance matrices, Gram matrices, convex Hessians, and every kernel method.
Moore–Penrose PseudoinverseFor any \( A \in \mathbb{R}^{m \times n} \) with SVD \( A = U\Sigma V^\top \), the Moore–Penrose pseudoinverse is \( A^+ = V\Sigma^+ U^\top \), where \( \Sigma^+ \) inverts the non-zero singular values and transposes. \( A^+ b \) is the minimum-norm least-squares solution to \( Ax = b \) — the canonical generalisation of \( A^{-1} \) for rectangular and rank-deficient matrices.
QR DecompositionAny \( A \in \mathbb{R}^{m \times n} \) with \( m \ge n \) factors as \( A = QR \), with \( Q \in \mathbb{R}^{m \times n} \) having orthonormal columns and \( R \in \mathbb{R}^{n \times n} \) upper triangular. It is the numerically preferred route to OLS, avoids squaring the condition number, and is the work-horse of least-squares solvers in every major numerical library.
Cholesky DecompositionEvery symmetric positive-definite matrix \( A \) factors uniquely as \( A = LL^\top \) with \( L \) lower triangular with positive diagonal. Cholesky is twice as fast as LU, numerically stable without pivoting, and the standard building block for Gaussian-process inference, Kalman filters, and sampling from multivariate normals.
Matrix Calculus and DifferentialsMatrix calculus expresses derivatives of scalar, vector, and matrix-valued functions of matrix inputs in compact form, avoiding entrywise index chasing. Working with the differential \( dL = \text{tr}(G^\top dX) \) identifies \( \nabla_X L = G \) directly and turns backpropagation into linear algebra.
Convex Optimization FundamentalsA convex optimization problem minimises a convex function over a convex feasible set. Its defining property: every local minimum is global. Convexity also gives polynomial-time algorithms with strong guarantees — the backdrop against which non-convex deep learning is understood as a deliberate departure.
KKT Conditions and Lagrangian DualityThe Karush–Kuhn–Tucker (KKT) conditions generalise \( \nabla f = 0 \) to constrained optimisation, giving necessary (and, for convex problems, sufficient) conditions for a minimum. Lagrangian duality transforms the constrained primal into a dual over multipliers; for convex problems with Slater's condition, the duality gap closes — the workhorse derivation behind SVMs, interior-point methods, and much of RL.
Newton's Method and Quasi-Newton (L-BFGS)Newton's method uses the second-order model \( f(x+\Delta) \approx f(x) + g^\top \Delta + \tfrac{1}{2}\Delta^\top H \Delta \) to take the step \( \Delta = -H^{-1} g \). It converges quadratically near a minimum but is impractical in high dimensions. Quasi-Newton methods (BFGS, L-BFGS) approximate \( H^{-1} \) from gradient histories, retaining super-linear convergence at the cost of a Hessian solve per step.
Proximal Gradient Methods (ISTA / FISTA)Proximal gradient methods solve objectives of the form \(f(x)+g(x)\) where \(f\) is smooth and \(g\) is nonsmooth but has an easy proximal operator. Each step does gradient descent on \(f\) and then a shrinkage- or projection-like update for \(g\), which is why the method underlies Lasso, sparse coding, and constrained convex optimization.
Conjugate Gradient MethodConjugate gradient (CG) solves \( Ax = b \) for symmetric PD \( A \) using only matrix-vector products, converging in at most \( n \) steps in exact arithmetic and much faster when the eigenvalue spectrum is clustered. CG underlies truncated-Newton optimisation, the natural-gradient in K-FAC, and large-scale Gaussian-process inference.
Random Variables, Expectation, and Variance (Axiomatic)A random variable \( X \) is a measurable map from a probability space \( (\Omega, \mathcal{F}, P) \) to \( \mathbb{R} \). Its expectation \( \mathbb{E}[X] = \int X\,dP \) and variance \( \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \) are the first two moments. Beyond textbook formulas, this axiomatic view explains why expectations linearise sums, exist iff \( \mathbb{E}|X| < \infty \), and commute with limits under uniform integrability.
Bernoulli, Binomial, Categorical, and Multinomial DistributionsThe four atomic discrete distributions: Bernoulli \( (p) \) for a single binary trial, Binomial \( (n, p) \) for a sum of \( n \) such trials, Categorical \( (\pi) \) for a single \( K \)-class outcome (softmax target), and Multinomial \( (n, \pi) \) for counts across \( n \) such trials. They form the likelihood backbone of logistic regression, cross-entropy training, and count models.
Beta and Dirichlet DistributionsThe Beta \( (\alpha, \beta) \) is the conjugate prior for Bernoulli/Binomial likelihoods; the Dirichlet \( (\boldsymbol{\alpha}) \) is its \( K \)-class generalisation, conjugate to Categorical/Multinomial. Both live on probability simplices and make Bayesian updating a matter of adding counts to hyperparameters — the cleanest introduction to conjugate Bayesian inference and the scaffolding for LDA.
Poisson and Exponential DistributionsPoisson \( (\lambda) \) models counts of independent events in a fixed interval; Exponential \( (\lambda) \) models the waiting time between them. They are the two marginals of the Poisson process — one discrete, one continuous — and between them cover rare-event counts, waiting times, and the building blocks of survival analysis, rate modelling, and queueing theory.
Mutual Information and Conditional EntropyConditional entropy \( H(Y \mid X) \) measures the residual uncertainty in \( Y \) given \( X \). Mutual information \( I(X; Y) = H(Y) - H(Y \mid X) \) measures how much \( X \) reduces the uncertainty in \( Y \) — equivalently, the KL divergence from the joint \( p(X,Y) \) to the product of marginals \( p(X)p(Y) \). Together they quantify dependence, drive information-bottleneck theory, and define decision-tree splits.
Hypothesis Testing, p-values, and Statistical PowerA hypothesis test compares a null \( H_0 \) to an alternative \( H_1 \) by computing a test statistic and its tail probability under \( H_0 \) — the p-value. Statistical power is \( 1 - \beta \), the probability of rejecting \( H_0 \) when \( H_1 \) is true. For ML evaluation, these are the tools that separate "this model is better" from "this split was lucky".
Cramér–Rao Lower BoundThe Cramér–Rao bound states that any unbiased estimator \( \hat\theta \) of a parameter \( \theta \) has variance \( \text{Var}(\hat\theta) \ge 1/I(\theta) \), where \( I(\theta) \) is the Fisher information. It is the foundational efficiency bound of classical statistics and gives an immediate lower bound on the uncertainty of MLE-based procedures.
Gaussian Process RegressionA Gaussian process defines a distribution over functions such that any finite set of evaluations is jointly Gaussian. Given a kernel \( k \) and noisy observations, the posterior at a test point is \( \mathcal{N}(\mu_*, \sigma_*^2) \) with closed-form mean and variance. GP regression gives both predictions and calibrated uncertainty, costs \( O(n^3) \) for \( n \) training points, and is the Bayesian counterpart of kernel ridge regression.
VC Dimension and ShatteringThe Vapnik–Chervonenkis dimension of a hypothesis class \( \mathcal{H} \) is the largest number of points \( \mathcal{H} \) can shatter — label in every possible way. It controls PAC generalisation: if \( \text{VC}(\mathcal{H}) = d \), then with probability \( 1 - \delta \) all \( h \in \mathcal{H} \) satisfy \( L(h) \le \hat L(h) + O(\sqrt{d/n}) \). VC dimension explains why linear classifiers in \( \mathbb{R}^d \) need \( \Omega(d) \) examples and why simple hypothesis classes generalise.
Rademacher ComplexityThe empirical Rademacher complexity of a function class \( \mathcal{F} \) on data \( S = (z_1, \dots, z_n) \) is \( \hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma[\sup_{f \in \mathcal{F}} \tfrac{1}{n}\sum_i \sigma_i f(z_i)] \) — the expected ability of \( \mathcal{F} \) to correlate with random \( \pm 1 \) signs. It is the data-dependent workhorse of modern generalisation bounds, usually tighter than VC, and gives direct norm-based bounds for deep networks.
No-Free-Lunch TheoremThe No-Free-Lunch theorem says that averaged uniformly over all possible target functions, no learning algorithm outperforms any other. In machine learning it means performance gains must come from inductive bias that matches the structure of the problems we actually care about.
Linear and Quadratic Discriminant Analysis (LDA / QDA)LDA and QDA are generative classifiers: model class-conditionals \( p(x \mid y = k) = \mathcal{N}(\mu_k, \Sigma_k) \) and apply Bayes' rule. LDA assumes shared covariance \( \Sigma \) across classes, giving linear decision boundaries and shrinkage-like regularisation; QDA uses per-class \( \Sigma_k \), giving quadratic boundaries at the cost of \( K \times d^2 \) parameters. Both are the generative counterparts of logistic regression.
Lottery Ticket HypothesisA dense randomly-initialised neural network contains subnetworks ("winning tickets") that — when trained in isolation with their original initialisation — match the full network's accuracy in the same number of steps. This Frankle–Carbin observation motivates one-shot and iterative magnitude pruning as search algorithms for sparse trainable subnetworks, reframing pruning as an initialisation search rather than a post-hoc compression.
Loss Landscape: Flat vs Sharp MinimaFlat minima (low curvature / small Hessian eigenvalues) generalise better than sharp minima (high curvature), empirically and via PAC-Bayes bounds. SGD's noise, large batch sizes, and the Sharpness-Aware Minimisation (SAM) optimiser all interact with this: small-batch SGD prefers flat minima, large-batch SGD falls into sharper ones, and SAM explicitly penalises sharpness during training.
Implicit Regularisation of SGDOver-parameterised networks trained by SGD generalise despite being able to fit pure noise — SGD's trajectory biases the solution toward specific minima. For linear models, gradient flow converges to the minimum-norm interpolator; for deep nets, SGD with small LR and moderate batch behaves like Bayesian inference with an implicit prior on flat minima. This implicit bias is why modern deep learning does not need explicit capacity control.
Multi-Armed Bandits: ε-Greedy, UCB, Thompson SamplingA multi-armed bandit is the simplest setting for exploration vs exploitation: \( K \) arms, each with unknown reward distribution; pull one per step; minimise cumulative regret. The three canonical strategies — ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling — illustrate optimism, confidence-based exploration, and probability matching respectively, and generalise to contextual bandits and full RL.
Temporal-Difference Learning: TD(0), SARSA, TD(λ)Temporal-difference learning updates value estimates using the Bellman bootstrapped target \( R_t + \gamma V(S_{t+1}) \) rather than a full Monte Carlo return. TD(0) is the one-step instance; SARSA extends it on-policy to action-values; TD(λ) interpolates between TD(0) and Monte Carlo via eligibility traces. TD is the central learning rule behind Q-learning, DQN, and actor-critic.
Policy Gradient TheoremFor a stochastic policy \( \pi_\theta(a \mid s) \), the gradient of expected return \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] \) is \( \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(A \mid S) \cdot Q^{\pi_\theta}(S, A)] \). No gradient flows through the environment dynamics — the theorem turns RL into a stochastic optimisation over policy parameters. It is the foundation of REINFORCE, actor-critic, PPO, TRPO, and GRPO.
Trust Region Policy Optimization (TRPO)TRPO performs policy updates by solving a constrained optimisation: maximise a surrogate advantage subject to \( \mathbb{E}[D_{\text{KL}}(\pi_\text{old} \| \pi_\theta)] \le \delta \). The KL trust region gives a monotonic-improvement guarantee and prevents collapse under function approximation. TRPO is solved by natural-gradient + line search; PPO is its first-order clipped approximation with near-identical performance at much lower cost.
Active Inference and the Free-Energy PrincipleFriston's free-energy principle frames perception, learning, and action as minimising a single quantity — variational free energy, a KL-plus-accuracy bound on surprise. Active inference extends the principle to behaviour: agents select actions that minimise expected free energy, simultaneously seeking reward and information. It is a generative-model-based alternative to classical RL with tight links to variational inference, ELBO, and exploration–exploitation.
Wasserstein Distance & Optimal TransportThe \( p \)-Wasserstein distance \( W_p(\mu,\nu) = \inf_{\gamma \in \Pi(\mu,\nu)} \big( \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|^p \big)^{1/p} \) measures the minimum cost of reshaping distribution \( \mu \) into \( \nu \). It underpins WGAN, flow matching, and a whole family of divergences that remain well-behaved when KL blows up.
f-Divergences (Unified View)For any convex \( f \) with \( f(1) = 0 \), the \( f \)-divergence \( D_f(P \| Q) = \mathbb{E}_Q[f(dP/dQ)] \) recovers KL (\( f = t \log t \)), reverse KL, Jensen–Shannon, total variation, \( \chi^2 \), Hellinger, and α-divergences as special cases. The variational (Fenchel) form underlies f-GAN and density-ratio estimation.
Itô Calculus & Stochastic Differential EquationsItô calculus extends ordinary calculus to processes driven by Brownian motion. An SDE \( dX_t = \mu(X_t, t)\,dt + \sigma(X_t, t)\,dW_t \) combines a drift and a diffusion term; Itô's lemma replaces the chain rule. This is the mathematical substrate of score-based diffusion models, flow matching, and neural SDEs.
Fokker–Planck & Probability-Flow ODEThe Fokker–Planck equation \( \partial_t p_t = -\nabla \cdot (f p_t) + \tfrac{1}{2} \nabla^2 : (g g^\top p_t) \) governs how the density of an SDE-driven process evolves. The probability-flow ODE shares these exact marginals with a deterministic vector field, enabling DDIM-style deterministic sampling and likelihood computation.
Reproducing Kernel Hilbert Spaces (RKHS) & the Representer TheoremAn RKHS is a Hilbert space of functions where evaluation at a point can be written as an inner product with a kernel function. The representer theorem says that many regularized empirical-risk problems in an RKHS have solutions that are finite sums of kernel evaluations at the training points, making kernel methods practical.
Mercer's Theorem & Kernel Feature MapsMercer's theorem shows that a continuous positive-semidefinite kernel can be expanded in nonnegative eigenvalues and orthogonal eigenfunctions. This makes kernel functions equivalent to inner products in a possibly infinite-dimensional feature space and motivates the kernel trick.
Natural Gradient & Fisher–Rao GeometryThe natural gradient \( \tilde\nabla_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L} \) preconditions the Euclidean gradient by the inverse Fisher information matrix, yielding steepest descent under the KL-divergence metric on the statistical manifold. It underlies K-FAC, Shampoo, TRPO's trust region, and the original motivation for reparameterisation-invariant optimisation.
Hamiltonian Monte Carlo (HMC) & NUTSHMC augments the target distribution with an auxiliary momentum variable and simulates Hamiltonian dynamics so that proposals move long distances while staying on near-constant-energy shells. It explores complex posteriors dramatically faster than random-walk Metropolis; the No-U-Turn Sampler (NUTS) adapts the integration length automatically.
Metropolis–Hastings AlgorithmThe canonical MCMC recipe: propose \( \theta' \sim q(\theta' \mid \theta) \), accept with probability \( \min(1, \pi(\theta')q(\theta\mid\theta')/[\pi(\theta)q(\theta'\mid\theta)]) \). Produces a Markov chain with stationary distribution \( \pi \) for any valid proposal, turning intractable posterior sampling into a correctness-guaranteed iterative procedure.
Gibbs SamplingA special case of Metropolis–Hastings: cycle through variables, sampling each from its full conditional \( p(\theta_j \mid \theta_{-j}, D) \). When the conditionals are tractable (conjugate priors, mixture models, LDA), Gibbs is simple, always accepts, and has been the workhorse of Bayesian inference for four decades.
Concentration Inequalities (Hoeffding, Bernstein, McDiarmid)High-probability bounds on how far a sum or function of independent random variables can deviate from its mean. Hoeffding uses boundedness, Bernstein exploits known variance for a tighter bound, and McDiarmid handles functions whose value changes little when any single argument changes — the workhorses behind PAC / generalization proofs.
PAC-Bayes Generalization BoundsPAC-Bayes bounds the generalization gap of a stochastic classifier \( Q \) by a Kullback–Leibler term against a data-independent prior \( P \): with probability \( 1 - \delta \), \( \mathbb{E}_{h\sim Q}[R(h)] \le \mathbb{E}_{h\sim Q}[\hat R(h)] + O(\sqrt{\text{KL}(Q\|P)/n}) \). Used for non-vacuous bounds on overparameterised networks.
Hierarchical Clustering & LinkageBuilds a tree (dendrogram) of clusters either bottom-up (agglomerative: start with singletons, merge closest pairs) or top-down (divisive). The linkage criterion — single, complete, average, Ward — defines distance between clusters and dictates the cluster shapes the algorithm prefers.
Spectral Clustering & the Graph LaplacianConstruct a similarity graph over data points, embed each point into \( \mathbb{R}^k \) using the \( k \) smallest eigenvectors of the graph Laplacian, and run k-means in that space. The Laplacian's spectrum encodes cluster structure as low-frequency eigenmodes — works for non-convex clusters where Euclidean k-means fails.
Non-Negative Matrix Factorization (NMF)Factorise a nonnegative matrix \( V \approx W H \) with \( W, H \ge 0 \) entrywise. The nonnegativity constraint yields parts-based, interpretable components (topic–word, basis–image) and distinguishes NMF from PCA, whose sign-free components are typically holistic and hard to name.
Independent Component Analysis (ICA)ICA separates a linear mixture \( x = A s \) into statistically independent non-Gaussian sources by finding a de-mixing matrix \( W \) that maximises the non-Gaussianity of \( y = W x \). Classical application: the cocktail-party problem. Key distinction from PCA: maximises independence, not variance.
Kernel PCAPrincipal component analysis in the implicit feature space of a positive-definite kernel \( k(x, y) \). Eigendecomposes the centred Gram matrix \( K \) rather than the data covariance; recovers non-linear principal directions without ever instantiating the feature map. Used for non-linear dimensionality reduction and feature extraction.
Isomap & Locally Linear Embedding (LLE)Two classical manifold-learning algorithms: Isomap replaces Euclidean distances with geodesic distances on a \( k \)-NN graph and applies MDS; LLE reconstructs each point from its neighbours' linear weights and finds a low-dimensional embedding that preserves those weights. Both set the conceptual stage for t-SNE and UMAP.
Conditional Random Fields (CRF)A discriminative model of \( p(y \mid x) = \tfrac{1}{Z(x)} \exp \sum_k \lambda_k f_k(y, x) \) over structured outputs. Linear-chain CRFs add sequence-level constraints on top of per-token scores, enabling tractable training via forward–backward and Viterbi decoding — still the backbone of NER/tagging heads above neural encoders.
Bayesian Networks & Directed Graphical ModelsA Bayesian network is a DAG \( G \) over variables whose joint factorises as \( p(x) = \prod_i p(x_i \mid \text{pa}_G(x_i)) \). D-separation reads conditional independences off the graph; parameter learning uses MLE under complete data, EM under latent variables. The mathematical foundation of structured probabilistic modelling.
Belief Propagation (Sum–Product Algorithm)A message-passing algorithm that computes exact marginals on tree-structured graphical models in linear time and approximate marginals (loopy BP) on general graphs. Each node sends summarising messages to neighbours; final beliefs equal the product of incoming messages.
Variational Autoencoder (VAE)A latent-variable generative model trained by maximising the ELBO \( \mathcal{L}(x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\text{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \). The reparameterisation trick makes the encoder \( q_\phi \) differentiable; the decoder \( p_\theta \) learns to reconstruct \( x \) from latent codes \( z \sim \mathcal{N}(0, I) \).
β-VAE & Disentanglementβ-VAE replaces the ELBO's KL term with a weighted \( \beta \cdot D_{\text{KL}} \). Values \( \beta > 1 \) push the encoder toward an isotropic prior, encouraging each latent dimension to capture one independent factor of variation — the original disentanglement recipe.
Normalizing Flows (RealNVP, Glow)Invertible neural networks \( f_\theta: \mathbb{R}^d \to \mathbb{R}^d \) with tractable Jacobian determinant. The change-of-variables formula \( \log p_X(x) = \log p_Z(f(x)) + \log |\det J_f(x)| \) gives exact likelihood; sampling runs \( f^{-1} \). RealNVP and Glow use coupling layers to make both directions \( O(d) \) per step.
Autoregressive Flows (MAF & IAF)Flows in which the \( i \)-th output depends only on previous inputs \( x_{<i} \), giving a triangular Jacobian. MAF (masked autoregressive flow) has fast density evaluation but slow sampling; IAF (inverse autoregressive flow) is the mirror image — fast sampling, slow density. Both are cornerstones of modern density estimation.
Energy-Based Models (EBM)A generative model \( p_\theta(x) = \exp(-E_\theta(x))/Z(\theta) \) defined by a scalar energy \( E_\theta \). The intractable normaliser \( Z(\theta) = \int e^{-E_\theta(x)} dx \) precludes direct MLE; training uses contrastive divergence, score matching, or noise-contrastive estimation to approximate it.
Restricted Boltzmann Machines (RBM)A bipartite EBM over visible and hidden binary units with energy \( E(v, h) = -v^\top W h - b^\top v - c^\top h \). Conditional independence within each layer gives closed-form conditionals \( p(h\mid v), p(v\mid h) \); Hinton's Contrastive Divergence trains them and the RBM stack forms a deep belief net.
Noise-Contrastive Estimation (NCE)Learn an unnormalised model \( \tilde p_\theta(x) \) by training a binary classifier to distinguish data samples from noise samples. The classifier's logit becomes \( \log \tilde p_\theta(x) - \log q_{\text{noise}}(x) \), so the partition function is absorbed into a learnable constant. Foundation of word2vec's negative sampling and of InfoNCE contrastive learning.
Score-Based SDEs (Continuous-Time Diffusion)Song et al. (2021) showed that discrete-time DDPM and noise-conditional score models are both limits of a continuous-time SDE \( dx = f(x,t)dt + g(t)dW \). The unified framework gives a reverse-time SDE and a probability-flow ODE that share marginals, enabling flexible samplers (Euler, Heun, DPM-Solver) and exact likelihoods.
Neural Ordinary Differential EquationsA neural ODE defines the hidden-state evolution as \( dh/dt = f_\theta(h, t) \), integrated by a black-box ODE solver. Training uses the adjoint method to back-propagate at constant memory regardless of solver depth. Connects residual networks to continuous flows and underlies continuous normalising flows and flow matching.
Set Transformer & Deep Sets (Permutation Invariance)Deep Sets: any permutation-invariant function on sets equals \( \rho(\sum_i \phi(x_i)) \) for learnable \( \phi, \rho \). Set Transformer replaces the sum with self-attention via Induced Set Attention Blocks, giving element-wise interactions while remaining permutation-equivariant.
Sharpness-Aware Minimization (SAM)Minimise a loss whose value is worst-case over a small \( \rho \)-ball of weight perturbations: \( \min_\theta \max_{\|\varepsilon\| \le \rho} \mathcal{L}(\theta + \varepsilon) \). The ascent step \( \varepsilon^\star \approx \rho \, \nabla \mathcal{L}/\|\nabla \mathcal{L}\| \) biases training toward flat minima, improving generalisation across ViT, ResNet, and LLM finetuning.
Shampoo & K-FAC PreconditionersShampoo and K-FAC are second-order-inspired optimizers that precondition gradients with matrix or blockwise curvature information instead of only per-parameter learning rates. They aim to converge in fewer steps than Adam or SGD, especially in large-batch training where curvature estimates are more stable.
Meta-Learning: MAML & ReptileMAML learns an initialisation \( \theta \) such that one or a few SGD steps on a new task yield good performance — formally \( \min_\theta \sum_\tau \mathcal{L}_\tau(\theta - \alpha \nabla \mathcal{L}_\tau(\theta)) \). Reptile is a first-order simplification that moves \( \theta \) toward per-task adapted parameters. Influential early 'learn to learn' recipes, since absorbed into prompt-based few-shot learning.
InfoNCE & NT-Xent Contrastive LossesInfoNCE maximises a mutual-information lower bound by classifying a positive pair against \( k \) negatives: \( \mathcal{L} = -\log \exp(s^+) / \sum_i \exp(s_i) \). NT-Xent is InfoNCE with temperature-scaled cosine similarities. Drives SimCLR, MoCo, CLIP, and most modern self-supervised representation learning.
Platt Scaling & Isotonic RegressionPlatt scaling calibrates a classifier by fitting a one-dimensional logistic regression from raw scores to probabilities on a validation set. Isotonic regression is a more flexible monotonic alternative that can fit non-sigmoid calibration curves, but it usually needs more calibration data to avoid overfitting.
Bayesian Deep Learning: MC Dropout & Deep EnsemblesMC dropout estimates predictive uncertainty by keeping dropout active at test time and averaging many stochastic forward passes. Deep ensembles train several independently initialized models and usually give stronger uncertainty estimates, at higher training and serving cost.
Conformal PredictionA distribution-free procedure for turning any point predictor into a prediction set with guaranteed finite-sample coverage \( 1 - \alpha \) under exchangeability. Requires only a scoring function and a calibration set; no assumption on the underlying model or data distribution.
Paired Bootstrap & McNemar's Test for Model ComparisonTwo non-parametric procedures for deciding whether model A beats model B with statistical significance. Paired bootstrap resamples matched predictions to estimate a confidence interval on the metric difference. McNemar's test uses a chi-squared on the \( 2\times 2 \) contingency table of agreement / disagreement.
Langevin Dynamics & MALALangevin MCMC treats gradient noise as deliberate: proposals drift along \( -\nabla \log \pi(\theta) \) plus Gaussian perturbation so the chain targets \( \pi \). Metropolis-adjusted Langevin (MALA) corrects discretisation bias with an MH acceptance; unadjusted Langevin (ULA) trades bias for simplicity and scales to big data via stochastic gradients (SGLD).
PixelCNN / PixelCNN++Autoregressive image models that factor \( p(x) = \prod_i p(x_i \mid x_{1:i-1}) \) with masked convolutions so each pixel sees only pixels above and to the left. Tractable likelihood and sharp samples; PixelCNN++ improves expressive conditioners (e.g. gated activations, horizontal/vertical stacks).
IPO (Identity Preference Optimization)Replaces DPO's sigmoid objective with a squared-error criterion on preference probabilities: \( \mathcal{L}_{\text{IPO}} = \mathbb{E}[(h_\theta(y_w, y_l) - 1/(2\beta))^2] \). Prevents DPO's tendency to over-separate preferred and rejected responses on easy pairs, reducing overfitting and improving generalisation.
Transcoders & Sparse CrosscodersTranscoders and sparse crosscoders are interpretability models that learn sparse dictionaries linking features across layers rather than explaining one layer in isolation. They are used to trace how a concept is transformed, preserved, or split as it moves through a network.
Causal Scrubbing & Mediation AnalysisRigorous protocols for validating interpretability hypotheses. Causal scrubbing replaces the hypothesised-irrelevant computations with samples from a distribution that should preserve the output; mediation analysis tests whether a candidate component mediates the causal effect of an input on an output. Tools for turning 'this feature looks meaningful' into falsifiable claims.
Circuit Discovery Pipelines (ACDC, Attribution Patching)Automated methods to locate the minimal sub-graph of attention heads and MLP components responsible for a given behaviour. ACDC greedily ablates edges in a causal graph, keeping only those whose removal degrades the behaviour; attribution patching approximates this with a single forward-backward pass per hypothesis.
Representation Engineering & the Refusal DirectionShift model behaviour by directly manipulating residual-stream activations along interpretable directions. The 'refusal direction' (Arditi et al. 2024) is a single direction in activation space whose ablation jailbreaks open-weight chat models, and whose injection forces refusal — evidence that safety training installs a shallow, targeted feature.
Concept Erasure & Null-Space ProjectionRemove a protected concept (gender, ethnicity, refusal, a specific memory) from representations by iteratively projecting activations onto the null space of linear classifiers for that concept. Achieves provable linear guarding of downstream use against the erased attribute, with bounded utility loss.
Differential Privacy & DP-SGDFormal guarantee that an algorithm's output distribution barely changes if any single training example is replaced. DP-SGD achieves \( (\varepsilon, \delta) \)-DP by clipping per-example gradients and adding calibrated Gaussian noise. Central to privacy-preserving ML training on sensitive data.
Membership Inference Attacks (MIA)Determine whether a specific example was in the training set of a deployed model. Attacks exploit loss / confidence gaps between seen and unseen examples — trained models are typically more confident on memorised training points. A baseline for privacy leakage in ML systems.
CTC Loss & RNN-TransducerTwo objectives for training sequence-to-sequence models when alignment between input and output frames is unknown. CTC sums over all alignment paths with blank symbols; RNN-T decomposes into a prediction network and joint network to model output-length independently. Backbones of modern ASR pipelines.
Mean Field Theory of Neural NetworksMean-field theory studies very wide neural networks by tracking distributions of parameters or activations instead of individual weights. It yields clean scaling limits for training dynamics and feature learning, and helps distinguish true feature-learning regimes from the lazy-training NTK regime.
Information Bottleneck TheoryInformation Bottleneck theory studies representations that preserve information about the target while compressing information about the input, often through a trade-off like \(I(Z;Y) - eta I(Z;X)\). It is a useful lens on representation learning and generalization, though its direct explanatory power for deep networks remains debated.
Stability and GeneralizationAn algorithm is uniformly \( \beta \)-stable if replacing one training point changes its output's loss by at most \( \beta \). Bousquet & Elisseeff (2002) proved that \( \beta \)-stability bounds the generalization gap by \( O(\beta + 1/\sqrt{n}) \); Hardt, Recht & Singer (2016) showed SGD on smooth losses is \( O(T/n) \)-stable, giving the first algorithm-dependent generalization bound for deep learning that grows with training time.
Algorithmic Alignment TheoryA neural architecture generalises better on a reasoning task when its computational structure aligns with the algorithm that solves the task. Xu et al. (2020) formalise sample complexity in terms of the number of network modules that must be learned and the per-module learnability, predicting that GNNs (multi-step message passing) align with dynamic-programming algorithms while plain MLPs do not.
Spectral Bias of Neural NetworksSpectral bias is the tendency of gradient-trained neural networks to learn low-frequency or smooth components of a target function before high-frequency ones. This helps explain why neural nets often fit coarse structure early and fine detail later.
Neural CollapseAt the terminal phase of training (TPT) — long after zero training error — the last-layer features and classifier weights of a deep classifier converge to a highly symmetric configuration: per-class feature means form a Simplex Equiangular Tight Frame (ETF), within-class variability collapses to zero, classifier weights align with the class means, and prediction reduces to nearest-class-centre. Papyan, Han & Donoho (2020) established this as a robust empirical phenomenon across architectures and datasets.
Mode Connectivity in Loss LandscapesMode connectivity is the empirical finding that independently trained solutions can often be connected by a low-loss path in parameter space. This suggests that many minima in deep learning are not isolated basins but parts of wider connected regions.
Loss Landscape VisualizationMethods for visualising high-dimensional loss surfaces by projecting parameters onto 1-D or 2-D subspaces. Goodfellow's linear interpolation (2014) plots loss along the line between two solutions; Li et al.'s filter normalisation (2018) plots loss in a 2-D plane spanned by random Gaussian directions normalised per-filter. The latter reveals that residual connections smooth the landscape and that flat minima correspond to wide bowls in the visualisation.
Contrastive Learning TheoryThe theoretical account of why contrastive self-supervised objectives like InfoNCE produce useful representations. Wang & Isola (2020) show the InfoNCE loss decomposes into two asymptotic terms — \( \mathcal{L}_{\text{align}} \), pulling positive pairs together, and \( \mathcal{L}_{\text{unif}} \), spreading the marginal feature distribution uniformly on the hypersphere. The downstream linear-probe accuracy correlates almost perfectly with this alignment-uniformity trade-off, giving a geometric explanation for why contrastive learning works at all.
Disentangled Representation LearningDisentangled representation learning seeks latent coordinates that each correspond to separate underlying factors of variation in the data. It is attractive for control and interpretability, but in the unsupervised setting true disentanglement is usually not identifiable without extra inductive bias or supervision.
Representation CollapseRepresentation collapse is the failure mode where many inputs map to nearly the same embedding or hidden state, destroying useful information. It appears in several forms — constant-vector collapse in self-supervision, dimensional collapse where only a few directions survive, and cluster collapse in discrete latents — and each requires a different fix.
Invariance vs EquivarianceA representation is invariant to a transformation if the output does not change when the input is transformed, and equivariant if the output changes in a predictable transformed way. CNN translation equivariance and classifier translation invariance are the canonical example pair.
Embedding GeometryEmbedding geometry studies what information is encoded in the distances, angles, and directions of an embedding space. Properties like similarity, analogy structure, anisotropy, and clustering determine how useful an embedding is for retrieval and downstream tasks.
Representation Alignment Across ModalitiesRepresentation alignment across modalities trains different encoders so paired inputs, such as an image and its caption, land near each other in a shared embedding space. This makes cross-modal retrieval and transfer possible by giving different modalities a common geometry.
Tokenization as RepresentationTokenization is not just preprocessing: it decides which units the model can represent directly and therefore shapes the statistics the model learns. The choice of characters, subwords, bytes, or domain-specific tokens changes sequence length, vocabulary size, inductive bias, and how cleanly concepts map into embeddings.
Autoregressive vs Diffusion TradeoffsAutoregressive models factorise \( p(x) = \prod_t p(x_t \mid x_{<t}) \) and dominate text generation; diffusion models learn a denoising process and dominate continuous-modality generation. The two paradigms differ in likelihood tractability, sampling cost, controllability, and compositionality — and the right choice depends on whether tokens are discrete, parallel decoding is required, and whether log-likelihood or perceptual quality is the figure of merit.
In-Context Learning MechanismsIn-context learning (ICL) is the empirical phenomenon that a frozen LLM solves new tasks from few-shot examples in the prompt. Mechanistic studies show ICL is implemented by a small set of attention circuits — induction heads, function vectors, and implicit gradient-descent-like updates — that emerge during pretraining once the data and depth budget cross a threshold.
Alignment Techniques (RLHF, DPO, RLAIF, comparison)Modern LLM alignment uses preference data to adjust a pretrained model so it follows instructions, refuses unsafe content, and ranks desired behaviours above undesired ones. The dominant recipes — RLHF with PPO, DPO and its variants, and RLAIF with AI-generated preferences — share the same Bradley–Terry preference model but differ in optimiser, reward-model dependence, and stability.
Offline Reinforcement LearningOffline reinforcement learning learns a policy from a fixed logged dataset without further interaction with the environment. Its central difficulty is distribution shift: Bellman backups evaluate actions that are poorly supported by the data, so modern methods either constrain the learned policy to stay near the behavior data or pessimistically down-value unsupported actions.
Model-Based Reinforcement LearningModel-based reinforcement learning learns an explicit or latent model of environment dynamics and uses that model for planning, imagination rollouts, or policy optimization. Its main advantage is sample efficiency, while its main failure mode is model bias: the policy can exploit errors in the learned simulator unless planning and training control compounding prediction error.
Decision TransformersDecision Transformers cast offline reinforcement learning as conditional sequence modeling: a Transformer predicts the next action from past returns-to-go, states, and actions. This avoids explicit Bellman backups and instead treats policy learning like autoregressive imitation conditioned on the desired future return.
Multi-Agent Reinforcement LearningMulti-agent reinforcement learning studies environments where several learning agents interact simultaneously, making each agent's dynamics depend on the evolving policies of the others. The main challenges are non-stationarity, coordination, and credit assignment, which is why centralized training with decentralized execution is a common modern design.
Exploration vs Exploitation (Deep RL View)Exploration versus exploitation is the trade-off between taking actions that seem best under current knowledge and taking actions that improve knowledge of the environment. In deep RL the problem is harder than in bandits because rewards can be sparse, state spaces are large, and short-term randomness is often not enough to discover long-horizon strategies.
Reward HackingReward hacking occurs when an agent maximizes the formal reward signal while failing at the designer's intended objective. It is a general Goodhart-style failure mode in reinforcement learning: stronger optimization pressure often finds loopholes in the proxy reward faster than humans can patch them.
Safe Reinforcement LearningSafe reinforcement learning studies how to optimize long-term return while satisfying safety constraints during training and deployment. The standard formalism is a constrained Markov decision process, where the policy must maximize reward subject to a bound on expected cost, risk, or unsafe-state visitation.
Hierarchical Reinforcement LearningHierarchical reinforcement learning decomposes control across time scales, usually by letting a high-level policy choose skills, options, or subgoals and a low-level policy execute them. This can make sparse-reward and long-horizon problems easier, but only if the learned hierarchy discovers reusable abstractions rather than collapsing back to flat control.
Data-Centric AIData-centric AI treats data quality, labeling, coverage, and feedback loops as first-class levers of model performance rather than focusing only on architecture or hyperparameters. The core workflow is to diagnose failure slices, improve the dataset that generates those failures, and measure whether the model gets better for the right reasons.
Adversarial Robustness (Modern Attacks)Adversarial robustness studies how learned models can be forced into wrong predictions by carefully chosen perturbations that are small, structured, or semantically deceptive. Modern attack families include gradient-based \(\ell_p\) attacks, universal perturbations, adversarial patches, and transfer attacks; defenses must avoid merely hiding gradients while leaving the model fragile.
Distribution Shift & Dataset ShiftDistribution shift occurs when the joint distribution seen at deployment differs from the one used for training or validation. The main cases are covariate shift, label shift, and concept shift; each breaks generalization in a different way and therefore requires different detection and mitigation strategies.
Uncertainty Estimation in Deep LearningUncertainty estimation in deep learning tries to quantify when a model should be unsure, not just what label it predicts. The key distinction is between aleatoric uncertainty from irreducible noise in the data and epistemic uncertainty from limited knowledge of the model, and modern methods such as deep ensembles, MC dropout, and conformal prediction target those uncertainties differently.
RLHF as KL-Regularized Policy OptimizationA deeper theoretical view of RLHF treats post-training as optimizing a policy against a learned reward while regularizing toward a reference model with a KL penalty. This viewpoint explains why PPO-RLHF, reward-model training, and even DPO-style objectives are closely related: they are different ways of solving or approximating the same regularized preference-optimization problem.
Law of Total ProbabilityThe law of total probability computes an event probability by summing over mutually exclusive, exhaustive cases. In machine learning it is the basic marginalization identity behind latent-variable models, mixture models, and many Bayesian calculations.
Conditional IndependenceConditional independence means two variables become unrelated once a third variable is known. It is the simplifying assumption that makes graphical models tractable and explains why conditioning can either remove dependence or, in collider structures, create it.
Sufficient StatisticsA sufficient statistic is a summary of the sample that retains all information about a parameter relevant for inference. This is why many classical models can replace an entire dataset with counts, sums, or means without changing the likelihood-based conclusions about the parameter.
Bayes Risk and the Bayes Optimal ClassifierBayes risk is the minimum achievable expected loss under the true data distribution, and the Bayes-optimal classifier attains it by minimizing posterior expected loss for each input. Under ordinary 0–1 loss, that rule becomes “predict the class with highest posterior probability.”
Proper Scoring RulesA scoring rule is proper if a forecaster minimizes expected score by reporting their true predictive distribution. Proper scoring rules matter because they reward honest, calibrated probabilities rather than merely getting the top-ranked class right.
Surrogate Losses and Classification CalibrationSurrogate losses replace hard-to-optimize 0–1 classification loss with tractable objectives such as logistic or hinge loss. A surrogate is classification-calibrated if optimizing it still drives the classifier toward the Bayes-optimal decision rule.
Empirical Risk Minimization (ERM)Empirical risk minimization chooses the model with the smallest average training loss. It is the default principle behind most supervised learning, but it must be paired with capacity control or held-out evaluation because low training loss alone does not guarantee generalization.
Structural Risk Minimization (SRM)Structural risk minimization extends empirical risk minimization by balancing training fit against model complexity. It is the learning-theoretic principle behind regularization, margin control, and choosing among hypothesis classes of different capacity.
Confusion MatrixA confusion matrix counts predicted labels against true labels. In binary classification it yields the four basic counts—true positives, false positives, true negatives, and false negatives—from which most common thresholded metrics are derived.
Precision-Recall Curve and Average PrecisionA precision-recall curve shows how precision and recall trade off as the decision threshold moves through a ranked list of predictions. Average precision summarizes that curve and is especially informative when the positive class is rare.
Hinge LossHinge loss penalizes examples that are misclassified or that lie too close to the decision boundary. It is the convex margin-based loss underlying soft-margin SVMs and emphasizes confident separation rather than calibrated probabilities.
Cost-Sensitive LearningCost-sensitive learning assigns different penalties to different kinds of mistakes instead of treating every error equally. It is the right framework when the real objective is to minimize downstream harm or utility loss rather than raw misclassification rate.
Markov Decision Process (MDP)A Markov decision process formalizes sequential decision-making with states, actions, transitions, rewards, and a discount factor. Its key assumption is that the next-state and reward distribution depends only on the current state and action, which makes Bellman-style planning possible.
Dynamic Programming for RLDynamic programming solves an MDP with a known model by repeatedly applying Bellman updates until values or policies become self-consistent. Policy evaluation, policy improvement, policy iteration, and value iteration are the core algorithms in that family.
Potential Outcomes FrameworkThe potential outcomes framework defines causal effects by comparing the outcomes a unit would have under different treatments. Because only one of those potential outcomes is observed for any given unit, causal inference is fundamentally about identifying missing counterfactuals under defensible assumptions.
Confounding, Colliders, and Simpson’s ParadoxConfounders create misleading associations because they affect both treatment and outcome, while colliders create bias when you condition on them. Simpson’s paradox is the visible symptom that aggregate and stratified associations can reverse direction when the underlying causal structure is ignored.
Attention Is All You Need“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.
AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.
Multiple Hypothesis Testing and False Discovery RateMultiple hypothesis testing asks how to control false positives when many tests are run at once. False discovery rate control, especially the Benjamini–Hochberg procedure, limits the expected fraction of rejected hypotheses that are actually null and is usually less conservative than family-wise error control.
Likelihood Ratio TestsA likelihood ratio test compares how well two nested statistical models explain the same data by taking the ratio of their maximized likelihoods. Large likelihood-ratio statistics indicate that the larger model fits substantially better than the restricted one, and under regularity conditions the test statistic is asymptotically chi-squared.
Importance SamplingImportance sampling estimates an expectation under a target distribution by drawing samples from a different proposal distribution and reweighting them. It is powerful when the proposal places more mass in the important regions of the integrand, but unstable weights can make the variance explode.
Bootstrap Confidence IntervalsBootstrap confidence intervals estimate uncertainty by resampling the observed dataset with replacement and recomputing the statistic many times. They are useful when analytic standard errors are awkward, but they inherit the sample's biases and can fail when the original sample is too small or unrepresentative.
Brier ScoreThe Brier score measures the mean squared error of probabilistic predictions, so it rewards both correctness and calibration. Lower is better, and unlike accuracy it penalizes a confidently wrong 0.99 prediction much more than a cautious 0.6 prediction.
One-Class SVMA one-class SVM learns a boundary around mostly normal data by separating mapped training points from the origin with maximum margin in feature space. It is a classic novelty-detection method because it needs examples of the inlier class but not labeled anomalies.
Anomaly DetectionAnomaly detection identifies observations that look unlikely under the pattern of normal data. The main families are density-based methods, reconstruction-based methods, and one-class classification methods, and the right choice depends on whether you have labels, strong feature engineering, or only normal examples.
Latent Dirichlet Allocation (LDA Topic Models)Latent Dirichlet Allocation models each document as a mixture of latent topics and each topic as a distribution over words. It is a generative model for uncovering coarse semantic structure in bag-of-words corpora, not a modern contextual language model.
Factor AnalysisFactor analysis models observed variables as linear combinations of a small number of latent factors plus variable-specific noise. It is useful when the goal is to explain covariance structure rather than merely reduce dimension, which is the key difference from PCA.
Kalman FilterThe Kalman filter recursively estimates the hidden state of a linear Gaussian dynamical system by alternating a prediction step with a measurement update. It is optimal for that model class because the posterior remains Gaussian and is fully described by a mean and covariance.
Particle FilterA particle filter approximates the posterior over a hidden state with weighted samples, or particles, instead of a single Gaussian. It is useful for nonlinear or non-Gaussian state-space models, but resampling and weight degeneracy are central practical issues.
Canonical Correlation Analysis (CCA)Canonical correlation analysis finds linear combinations of two random vectors that are maximally correlated with each other. It is the right tool when the question is about shared structure between two views of the same examples rather than variance within a single view.
Missing Data and ImputationMissing-data methods try to preserve inference when some values are unobserved by modeling why data are missing and how to fill or integrate over the missing entries. The key distinction is between MCAR, MAR, and MNAR, because imputation is far safer when missingness can be treated as conditionally ignorable.
Value IterationValue iteration solves a known Markov decision process by repeatedly applying the Bellman optimality backup until the value function converges. Once the optimal value is approximated, a greedy policy with respect to that value is optimal or near-optimal.
Policy IterationPolicy iteration alternates between evaluating the current policy and improving it by acting greedily with respect to that value function. It often converges in fewer outer loops than value iteration because each improvement step uses a more fully solved subproblem.
Monte Carlo Reinforcement LearningMonte Carlo reinforcement learning estimates values from complete sampled returns rather than from one-step bootstrapped targets. That makes the targets unbiased with respect to the episode return, but usually higher variance than temporal-difference methods.
Credit Assignment ProblemThe credit assignment problem is the problem of determining which earlier actions, states, or internal computations deserve blame or credit for a later outcome. It is hard because rewards and losses are often delayed, sparse, or distributed across many interacting decisions.
Instrumental VariablesInstrumental variables identify causal effects when treatment is confounded, provided an instrument affects treatment, is as-good-as random with respect to unobserved confounders, and influences the outcome only through treatment. In simple linear settings, the IV estimand is the ratio of the instrument-outcome covariance to the instrument-treatment covariance.
CounterfactualsA counterfactual asks what would have happened under a different action or treatment than the one that actually occurred. The central difficulty is that for any individual unit, only one potential outcome is observed, so causal inference always requires assumptions to recover the missing alternative.
Do-CalculusDo-calculus is Pearl's set of graphical transformation rules for turning interventional quantities into observational quantities when the causal graph permits it. It matters because it separates what can be identified from data plus structure from what remains fundamentally unidentifiable.
Bahdanau AttentionBahdanau attention is the original additive attention mechanism for sequence-to-sequence models, where the decoder scores each encoder state before producing the next token. It solved the fixed-context bottleneck of early seq2seq RNNs by letting the decoder look back over the whole source sequence at every step.
Seq2Seq with AttentionSeq2seq with attention augments the encoder-decoder architecture so the decoder conditions on a context vector built from all encoder states at each output step. That change made neural machine translation far more effective than fixed-context seq2seq and directly paved the way to modern cross-attention and Transformer models.
Backpropagation — History (Werbos → Rumelhart/Hinton/Williams)The history of backpropagation is the story of an idea known in pieces before it became a practical neural-network training method. Werbos articulated reverse-mode differentiation for network training in the 1970s, and Rumelhart, Hinton, and Williams turned it into the landmark 1986 demonstration that made multilayer neural networks trainable in practice.