Tag: optimization

92 topic(s)

Linear regression basicsLinear regression fits an affine function to predict a continuous target by minimizing squared residuals between predictions and observed values. In matrix form, ordinary least squares projects the target vector onto the column space of the design matrix, yielding the normal equations in the full-rank case.
Xavier/Glorot InitializationXavier or Glorot initialization chooses weight variance from fan-in and fan-out so activations and gradients stay roughly stable across deep layers. It is well suited to symmetric activations such as tanh, while ReLU networks usually prefer He initialization.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
Triplet Margin LossTriplet margin loss trains an embedding space so an anchor is closer to a positive example than to a negative example by at least a fixed margin. It is a standard metric-learning objective because it directly enforces relative similarity rather than predicting class labels.
Lagrange MultipliersLagrange multipliers solve constrained optimization problems by introducing auxiliary variables that encode the constraints inside a single objective. At a constrained optimum, the gradient of the objective lies in the span of the constraint gradients, which is why the method is central to duality and SVM derivations.
Ordinary Least Squares (OLS) Closed-Form SolutionThe OLS closed-form solution is the exact least-squares answer computed directly from the design matrix rather than by iterative optimization. In the full-rank case it solves the normal equations, and geometrically it projects the target vector onto the column space of the features.
Convex FunctionA convex function is one whose value on any line segment lies below the chord connecting that segment's endpoints. This matters in optimization because convex problems have no spurious local minima: every local minimum is global.
Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo samples from a difficult target distribution by constructing a Markov chain whose stationary distribution matches that target. It is essential in Bayesian inference because it replaces intractable posterior integrals with averages over samples, provided the chain mixes well enough.
The Fisher Information MatrixThe Fisher Information Matrix measures how sensitive a model's log-likelihood is to changes in its parameters and therefore captures local statistical curvature. It underlies asymptotic variance bounds and natural gradient methods because it defines a geometry tied to the model's predictive distribution.
Loss functionA loss function maps a model's prediction and the true target to a scalar error signal that training aims to minimize. It defines what the model is optimized for, so changing the loss changes which mistakes are treated as costly.
Mean squared error (MSE)Mean squared error averages the squared difference between predicted and true values, making large errors count disproportionately more than small ones. For regression it is especially important because minimizing MSE is equivalent to maximum likelihood under Gaussian noise.
Prediction errorPrediction error is the difference between a model's prediction and the true target for an example. It is the atomic quantity from which losses, residual analysis, and generalization metrics are built.
Gradient descentGradient descent minimizes a differentiable objective by repeatedly moving parameters in the direction of steepest local decrease, namely the negative gradient. Its step size is set by the learning rate, so convergence depends on both objective geometry and update scale.
Mini-batch gradient descentMini-batch gradient descent estimates the gradient on a small subset of training examples at each update instead of on the full dataset or a single example. It is the practical default in deep learning because it balances hardware efficiency with optimization noise.
Stochastic gradient descent (SGD)Stochastic gradient descent updates parameters using a gradient estimate from one example or a very small random batch, making each step noisy but cheap. That noise can slow exact convergence yet often helps large models optimize and generalize in practice.
ConvergenceIn optimization, convergence means an algorithm's iterates approach a stable solution or stationary point as updates continue. In practice people often mean that the loss or parameters stop changing much, though true convergence depends on the objective and algorithmic assumptions.
HyperparameterA hyperparameter is a setting chosen outside the optimization loop, such as learning rate, model width, regularization strength, or batch size. Unlike learned parameters, hyperparameters govern how the model is trained or structured and are usually selected by validation.
BackpropagationBackpropagation computes gradients of a scalar loss with respect to all network parameters by applying the chain rule backward through the computation graph. It makes deep learning practical because it turns a complicated nested function into reusable local gradient calculations.
Backward passThe backward pass propagates gradients from the loss back through the computation graph to determine how each parameter affected the final error. It uses stored forward-pass intermediates and the chain rule to accumulate derivatives efficiently.
Chain ruleThe chain rule gives the derivative of a composition of functions by multiplying local derivatives along the computation path. It is the mathematical principle that backpropagation applies at scale throughout a neural network.
Automatic differentiationAutomatic differentiation computes exact derivatives of a program by systematically composing derivatives of its primitive operations. Unlike symbolic differentiation it does not manipulate formulas, and unlike numerical differentiation it does not rely on finite-difference approximations.
Computational graphA computational graph represents a calculation as nodes for variables or operations and edges for data dependencies. It is useful because the same graph that defines the forward computation can also be traversed backward to perform automatic differentiation.
Learning rateThe learning rate is the scalar that sets how large each optimization step is when parameters are updated. If it is too high training can diverge or oscillate, and if it is too low training can become extremely slow or get stuck.
Composite functionA composite function applies one function to the output of another, such as f of g of x. Neural networks are composite functions at scale, which is why gradients are computed by repeatedly applying the chain rule.
Backpropagation through time (BPTT)Backpropagation through time trains a recurrent network by unrolling it across sequence steps and applying backpropagation to the resulting deep computational graph. It exposes how earlier states influence later losses, but long unrolls make optimization and memory use difficult.
What is the vanishing gradient problem?The vanishing gradient problem is the tendency for gradients propagated through many layers or time steps to shrink exponentially, making early parameters learn extremely slowly. It is especially severe in deep sigmoid or tanh networks and was a main motivation for LSTMs, better initialization, and residual connections.
Negative log-likelihoodNegative log-likelihood is the loss obtained by negating the log-likelihood, so maximizing probability becomes minimizing a positive objective. It is the standard training loss for probabilistic classifiers, language models, and many generative models.
Maximum likelihood estimate (MLE)The maximum likelihood estimate selects the parameter values that make the observed data most probable under the model. Many standard ML objectives, including cross-entropy for classification and next-token prediction for LLMs, are just MLE written as minimization of negative log-likelihood.
Cross-entropyCross-entropy measures the average coding cost of samples from a true distribution \( p \) when encoded using a model distribution \( q \). In ML it is the standard loss for classification and language modeling, and minimizing it is equivalent to maximum likelihood up to an entropy constant.
Binary cross-entropyBinary cross-entropy is the cross-entropy loss for a Bernoulli target, typically \( -[y\log \hat p + (1-y)\log(1-\hat p)] \). It is the standard loss for binary classification and for multi-label problems where each label is predicted independently.
Linear RegressionLinear regression models a target as an affine function of the inputs, typically \( y \approx w^\top x + b \), and fits the parameters by minimizing squared residuals. It is the canonical baseline for regression because it is interpretable and often has a closed-form OLS solution.
Logistic RegressionLogistic regression is a linear classifier that models the log-odds of a class as \( w^\top x + b \) and maps that score through a sigmoid to get a probability. Despite its name, it is a classification model, not a regression model.
Full Fine-TuneA full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
Parameter-Efficient Fine-Tuning (PEFT)PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
Low-Rank Adaptation (LoRA)LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
LoRA AdapterA LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
Model MergingModel merging combines the weights or weight deltas of separately fine-tuned models into one model without full retraining. It can create multitask behavior cheaply, but naive averaging often causes interference unless the merged models share a common base and compatible parameter geometry.
Model SoupsModel soups average the weights of multiple fine-tuned models that lie in the same low-loss basin, often improving accuracy and robustness without extra inference cost. Unlike an ensemble, a soup is still a single model at serving time.
SLERP (Spherical Linear Interpolation)SLERP interpolates between two parameter vectors along the great-circle path on a sphere instead of using straight-line interpolation. In model merging it can preserve vector norms and sometimes produce smoother blends than linear interpolation when the two directions differ strongly.
TIES-MergingTIES-Merging is a model-merging method that reduces interference by trimming small parameter changes, resolving sign conflicts, and merging only updates aligned with an agreed sign. It is designed for cases where separately fine-tuned models disagree on which weights should move and in what direction.
DARE (Drop And REscale)DARE sparsifies fine-tuning deltas by randomly dropping many update entries and rescaling the rest, preserving most behavior while greatly reducing storage and merge interference. It is commonly used as a preprocessing step for delta compression or model merging rather than as a standalone training method.
Task VectorA task vector is the weight difference between a pre-trained model and the same model after fine-tuning on a task. Adding, subtracting, or scaling that vector can steer behavior, so task vectors provide a simple weight-space tool for editing or combining capabilities.
Model CompressionModel compression reduces a model’s memory, latency, or energy cost while trying to preserve performance. Common compression methods include distillation, pruning, quantization, low-rank factorization, and architecture redesign.
Knowledge DistillationKnowledge distillation trains a smaller student model to match the outputs or internal representations of a larger teacher. It transfers some of the teacher’s behavior into a cheaper model, often using soft targets that contain more information than hard labels alone.
PruningPruning removes weights, neurons, heads, or entire blocks that contribute little to performance. It can reduce model size or compute, but aggressive pruning usually needs fine-tuning to recover lost accuracy.
Data ParallelismData parallelism replicates the model on multiple devices and splits each batch across them, synchronizing gradients after each step. It is the simplest way to scale training throughput, but every device still stores the full model unless sharding is added.
Model ParallelismModel parallelism splits one model across multiple devices because it is too large or compute-heavy for a single device. The split can happen by layers, tensors, experts, or sequence chunks, trading memory savings for extra communication.
Pipeline ParallelismPipeline parallelism partitions a model by layers across devices and sends microbatches through the partitions like an assembly line. It reduces per-device memory, but pipeline bubbles and stage imbalance can waste throughput if the schedule is poorly tuned.
Fully Sharded Data Parallel (FSDP)Fully Sharded Data Parallel shards model parameters, gradients, and optimizer states across data-parallel workers, gathering full parameters only when needed for computation. It is the PyTorch analogue of ZeRO-style training and makes much larger models fit without custom model-parallel code.
Gradient ClippingGradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
Weight DecayWeight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
Adam OptimizerAdam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
AdaGradAdaGrad adapts learning rates by dividing each parameter’s update by the square root of the accumulated historical squared gradients. It works especially well for sparse features, but its learning rates can decay too aggressively over long training runs.
RMSPropRMSProp uses an exponential moving average of squared gradients to normalize updates, preventing the denominator from growing without bound as in AdaGrad. It is useful for nonstationary problems and was a key precursor to Adam.
MomentumMomentum accumulates a running velocity of past gradients so updates keep moving in consistent directions and damp noisy zig-zags. It speeds optimization in ravines and is commonly paired with SGD or Nesterov variants.
Curriculum LearningCurriculum learning trains a model on examples in an organized order, usually from easier or more structured cases to harder ones. The idea is to improve optimization and generalization by shaping the training distribution, though a bad curriculum can also slow learning or bias the model.
Memory Optimization (ML Training)Memory optimization in ML training is the collection of techniques that reduce peak memory so larger models or batches fit on available hardware. Common examples are mixed precision, activation checkpointing, optimizer sharding, offloading, and more memory-efficient attention kernels.
KL DivergenceKL divergence measures how one probability distribution differs from a reference distribution through an expected log-ratio. It is nonnegative and asymmetric, so it is best understood not as a distance but as the penalty for modeling samples from one distribution with another.
Expectation–Maximization (EM) AlgorithmThe EM algorithm is an iterative method for maximum-likelihood or MAP estimation in models with latent variables. Each round first estimates the hidden structure under the current parameters and then re-optimizes the parameters as if that hidden structure were known.
Jacobian and HessianThe Jacobian collects first-order partial derivatives of a vector-valued function, while the Hessian collects second-order partial derivatives of a scalar function. Together they describe local sensitivity and curvature, which is why they are central to optimization and dynamical analysis.
GrokkingGrokking is a delayed generalization phenomenon in which a model first memorizes the training set and only much later snaps into a simple algorithm that generalizes well. It is interesting because the model already had enough capacity to fit the data, yet the more general solution emerged only after long training and regularization pressure.
Auxiliary Load-Balancing Loss (MoE)The auxiliary load-balancing loss in a Mixture-of-Experts model encourages the router to spread tokens more evenly across experts. Without it, routing often collapses onto a few experts, which wastes capacity and creates severe hot spots in both learning and systems performance.
Muon OptimizerMuon is an optimizer designed especially for matrix-valued parameters that replaces the raw update direction with an orthogonalized one. The point is to respect matrix structure rather than treating every weight tensor as a flattened vector, with the goal of improving training efficiency relative to standard first-order optimizers.
GPTQ QuantizationA one-shot, layer-wise post-training quantization that pushes LLM weights to 3–4 bits while preserving generation quality. GPTQ reformulates quantization as a per-row error-minimisation problem solved greedily using the inverse Hessian of a small calibration set, achieving 3-bit LLaMA-65B with <1% perplexity loss.
AWQ (Activation-aware Weight Quantization)AWQ is a post-training quantization method that preserves quality by protecting the weights attached to the largest activation channels before rounding. It targets low-bit LLM inference with small accuracy loss and is popular because it is simpler to deploy than Hessian-based methods such as GPTQ.
Taylor Series ExpansionApproximates a smooth function near a point \( a \) by a polynomial whose coefficients are the function's derivatives: \( f(x) \approx \sum_{k=0}^{K} \frac{f^{(k)}(a)}{k!}(x-a)^k \). Provides the theoretical scaffolding for gradient descent (1st order), Newton's method (2nd order), Laplace approximations, and loss-landscape analysis.
Matrix Norms: Frobenius, Spectral, NuclearThe three dominant matrix norms measure different notions of size: Frobenius \( \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2} \) is the entrywise \( \ell_2 \); spectral (operator) \( \|A\|_2 = \sigma_{\max}(A) \) is the largest gain on unit vectors; nuclear \( \|A\|_* = \sum_i \sigma_i(A) \) is the convex surrogate for rank. Each appears in a distinct ML context: regularisation, Lipschitz bounds, and low-rank recovery respectively.
Convex Optimization FundamentalsA convex optimization problem minimises a convex function over a convex feasible set. Its defining property: every local minimum is global. Convexity also gives polynomial-time algorithms with strong guarantees — the backdrop against which non-convex deep learning is understood as a deliberate departure.
KKT Conditions and Lagrangian DualityThe Karush–Kuhn–Tucker (KKT) conditions generalise \( \nabla f = 0 \) to constrained optimisation, giving necessary (and, for convex problems, sufficient) conditions for a minimum. Lagrangian duality transforms the constrained primal into a dual over multipliers; for convex problems with Slater's condition, the duality gap closes — the workhorse derivation behind SVMs, interior-point methods, and much of RL.
Newton's Method and Quasi-Newton (L-BFGS)Newton's method uses the second-order model \( f(x+\Delta) \approx f(x) + g^\top \Delta + \tfrac{1}{2}\Delta^\top H \Delta \) to take the step \( \Delta = -H^{-1} g \). It converges quadratically near a minimum but is impractical in high dimensions. Quasi-Newton methods (BFGS, L-BFGS) approximate \( H^{-1} \) from gradient histories, retaining super-linear convergence at the cost of a Hessian solve per step.
Proximal Gradient Methods (ISTA / FISTA)Proximal gradient methods solve objectives of the form \(f(x)+g(x)\) where \(f\) is smooth and \(g\) is nonsmooth but has an easy proximal operator. Each step does gradient descent on \(f\) and then a shrinkage- or projection-like update for \(g\), which is why the method underlies Lasso, sparse coding, and constrained convex optimization.
Conjugate Gradient MethodConjugate gradient (CG) solves \( Ax = b \) for symmetric PD \( A \) using only matrix-vector products, converging in at most \( n \) steps in exact arithmetic and much faster when the eigenvalue spectrum is clustered. CG underlies truncated-Newton optimisation, the natural-gradient in K-FAC, and large-scale Gaussian-process inference.
Loss Landscape: Flat vs Sharp MinimaFlat minima (low curvature / small Hessian eigenvalues) generalise better than sharp minima (high curvature), empirically and via PAC-Bayes bounds. SGD's noise, large batch sizes, and the Sharpness-Aware Minimisation (SAM) optimiser all interact with this: small-batch SGD prefers flat minima, large-batch SGD falls into sharper ones, and SAM explicitly penalises sharpness during training.
Implicit Regularisation of SGDOver-parameterised networks trained by SGD generalise despite being able to fit pure noise — SGD's trajectory biases the solution toward specific minima. For linear models, gradient flow converges to the minimum-norm interpolator; for deep nets, SGD with small LR and moderate batch behaves like Bayesian inference with an implicit prior on flat minima. This implicit bias is why modern deep learning does not need explicit capacity control.
Natural Gradient & Fisher–Rao GeometryThe natural gradient \( \tilde\nabla_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L} \) preconditions the Euclidean gradient by the inverse Fisher information matrix, yielding steepest descent under the KL-divergence metric on the statistical manifold. It underlies K-FAC, Shampoo, TRPO's trust region, and the original motivation for reparameterisation-invariant optimisation.
Sharpness-Aware Minimization (SAM)Minimise a loss whose value is worst-case over a small \( \rho \)-ball of weight perturbations: \( \min_\theta \max_{\|\varepsilon\| \le \rho} \mathcal{L}(\theta + \varepsilon) \). The ascent step \( \varepsilon^\star \approx \rho \, \nabla \mathcal{L}/\|\nabla \mathcal{L}\| \) biases training toward flat minima, improving generalisation across ViT, ResNet, and LLM finetuning.
Lion OptimizerA momentum-only optimizer discovered by neural search: updates are \( \theta \leftarrow \theta - \eta \, \text{sign}(\beta_1 m + (1-\beta_1) g) \). Uses only the sign of a momentum estimate — no second moment, half the state of AdamW, often matches or beats it on large-scale training with smaller learning rates.
Shampoo & K-FAC PreconditionersShampoo and K-FAC are second-order-inspired optimizers that precondition gradients with matrix or blockwise curvature information instead of only per-parameter learning rates. They aim to converge in fewer steps than Adam or SGD, especially in large-batch training where curvature estimates are more stable.
AdaFactor OptimizerA memory-efficient Adam variant. Replaces the full second-moment matrix with row and column running averages (a rank-1 factorisation): memory is \( O(m + n) \) instead of \( O(mn) \). Used by T5, Meta's BlenderBot, and many large-scale models to free memory for bigger batches and longer contexts.
Meta-Learning: MAML & ReptileMAML learns an initialisation \( \theta \) such that one or a few SGD steps on a new task yield good performance — formally \( \min_\theta \sum_\tau \mathcal{L}_\tau(\theta - \alpha \nabla \mathcal{L}_\tau(\theta)) \). Reptile is a first-order simplification that moves \( \theta \) toward per-task adapted parameters. Influential early 'learn to learn' recipes, since absorbed into prompt-based few-shot learning.
Gradient Accumulation & Micro-BatchingSplit a large effective batch into \( k \) micro-batches; accumulate their gradients in a buffer, then step once. Decouples statistical batch size from hardware batch size, enabling \( N\cdot k \) effective batch without a \( k\times \) memory blow-up. Essential for training large models on any GPU.
Differential Privacy & DP-SGDFormal guarantee that an algorithm's output distribution barely changes if any single training example is replaced. DP-SGD achieves \( (\varepsilon, \delta) \)-DP by clipping per-example gradients and adding calibrated Gaussian noise. Central to privacy-preserving ML training on sensitive data.
Mean Field Theory of Neural NetworksMean-field theory studies very wide neural networks by tracking distributions of parameters or activations instead of individual weights. It yields clean scaling limits for training dynamics and feature learning, and helps distinguish true feature-learning regimes from the lazy-training NTK regime.
Stability and GeneralizationAn algorithm is uniformly \( \beta \)-stable if replacing one training point changes its output's loss by at most \( \beta \). Bousquet & Elisseeff (2002) proved that \( \beta \)-stability bounds the generalization gap by \( O(\beta + 1/\sqrt{n}) \); Hardt, Recht & Singer (2016) showed SGD on smooth losses is \( O(T/n) \)-stable, giving the first algorithm-dependent generalization bound for deep learning that grows with training time.
Mode Connectivity in Loss LandscapesMode connectivity is the empirical finding that independently trained solutions can often be connected by a low-loss path in parameter space. This suggests that many minima in deep learning are not isolated basins but parts of wider connected regions.
Gradient Noise ScaleA scalar diagnostic that estimates the largest useful batch size by comparing the variance of per-example gradients to the squared norm of the mean gradient: \( B_{\text{noise}} = \operatorname{tr}(\Sigma)/\|g\|^2 \). McCandlish et al. (2018) argue that returns to scaling batch size diminish sharply once \( B \gg B_{\text{noise}} \), giving a principled way to choose batch size during large-scale training.
Adaptive Gradient Clipping (AGC)A per-parameter clipping rule introduced by Brock et al. (2021) that bounds each weight's update by a fraction of the weight's own norm: clip \( g \) so \( \|g\|/\|w\| \le \lambda \). Unlike global-norm clipping, AGC scales naturally with parameter magnitude and made it possible to train Normalizer-Free Networks (NFNets) without batch normalisation while matching its training stability.
Self-Paced LearningA curriculum-learning variant where the model itself decides which examples are 'easy enough' to train on at the current step, by minimising a joint objective \( \sum_i v_i \ell_i - \lambda \sum_i v_i \) over both parameters \( \theta \) and per-example weights \( v_i \in \{0,1\} \) (or \( [0,1] \)). Kumar, Packer & Koller (2010) introduced it as a non-convex EM-style alternative to handcrafted curricula; \( \lambda \) is annealed from low (only easy examples) to high (all examples).
Gradient Surgery (PCGrad) for Multi-Task LearningWhen two task gradients in multi-task learning point in conflicting directions (negative cosine), they partially cancel each other and slow learning. Yu et al.'s PCGrad (2020) projects each task gradient onto the normal plane of any conflicting task's gradient before summing, removing the destructive component. This 'gradient surgery' restores monotone progress on both tasks at modest cost.
Neural Architecture Search (modern approaches)NAS automates the design of network architectures by searching a parameterised space against a validation objective. Modern NAS abandons the slow RL-controller approach (NASNet) in favour of weight-sharing one-shot supernets (DARTS, ProxylessNAS), zero-cost proxies, and architecture-aware scaling laws (EfficientNet, NFNet). NAS has produced strong vision backbones but plays a smaller role in the LLM era.
Backpropagation — History (Werbos → Rumelhart/Hinton/Williams)The history of backpropagation is the story of an idea known in pieces before it became a practical neural-network training method. Werbos articulated reverse-mode differentiation for network training in the 1970s, and Rumelhart, Hinton, and Williams turned it into the landmark 1986 demonstration that made multilayer neural networks trainable in practice.