Machine LearningWikiPaths

Foundations of Machine Learning

Build a solid mathematical and conceptual grounding. Covers linear models, error decomposition, and essential optimization.

Estimated time: ~60 min

Study this path with flashcards
206 cards
Study →
  1. Step 1
    AdaGrad adapts learning rates by dividing each parameter’s update by the square root of the accumulated historical squared gradients. It works especially well for sparse features, but its learning rates can decay too aggressively over long training runs.
  2. Step 2
    Adam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
  3. Step 3
    Automatic differentiation computes exact derivatives of a program by systematically composing derivatives of its primitive operations. Unlike symbolic differentiation it does not manipulate formulas, and unlike numerical differentiation it does not rely on finite-difference approximations.
  4. Step 4
    Backpropagation computes gradients of a scalar loss with respect to all network parameters by applying the chain rule backward through the computation graph. It makes deep learning practical because it turns a complicated nested function into reusable local gradient calculations.
  5. Step 5
    The backward pass propagates gradients from the loss back through the computation graph to determine how each parameter affected the final error. It uses stored forward-pass intermediates and the chain rule to accumulate derivatives efficiently.
  6. Step 6
    Batch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
  7. Step 7
    Bayes’ theorem updates beliefs by combining a prior with the likelihood of observed evidence to produce a posterior. In compact form, posterior is proportional to likelihood times prior, which is why Bayesian inference is fundamentally a rule for disciplined belief revision.
  8. Step 8
    The chain rule gives the derivative of a composition of functions by multiplying local derivatives along the computation path. It is the mathematical principle that backpropagation applies at scale throughout a neural network.
  9. Step 9
    A composite function applies one function to the output of another, such as f of g of x. Neural networks are composite functions at scale, which is why gradients are computed by repeatedly applying the chain rule.
  10. Step 10
    A computational graph represents a calculation as nodes for variables or operations and edges for data dependencies. It is useful because the same graph that defines the forward computation can also be traversed backward to perform automatic differentiation.
  11. Step 11
    Conditional independence means two variables become unrelated once a third variable is known. It is the simplifying assumption that makes graphical models tractable and explains why conditioning can either remove dependence or, in collider structures, create it.
  12. Step 12
    A confusion matrix counts predicted labels against true labels. In binary classification it yields the four basic counts—true positives, false positives, true negatives, and false negatives—from which most common thresholded metrics are derived.
  13. Step 13
    For a matrix A, an eigenvector is a nonzero direction that A only rescales, and the scaling factor is its eigenvalue. Eigenpairs matter because they reveal invariant directions, control stability, and make problems like PCA and spectral clustering possible.
  14. Step 14
    Empirical risk minimization chooses the model with the smallest average training loss. It is the default principle behind most supervised learning, but it must be paired with capacity control or held-out evaluation because low training loss alone does not guarantee generalization.
  15. Step 15
    Feature scaling rescales input dimensions to comparable magnitudes, while standardization specifically subtracts the training mean and divides by the training standard deviation. It matters because optimization, distances, and margins can otherwise be dominated by whichever feature uses the largest units.
  16. Step 16
    A full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
  17. Step 17
    Gradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
  18. Step 18
    Gradient descent minimizes a differentiable objective by repeatedly moving parameters in the direction of steepest local decrease, namely the negative gradient. Its step size is set by the learning rate, so convergence depends on both objective geometry and update scale.
  19. Step 19
    Builds a tree (dendrogram) of clusters either bottom-up (agglomerative: start with singletons, merge closest pairs) or top-down (divisive). The linkage criterion — single, complete, average, Ward — defines distance between clusters and dictates the cluster shapes the algorithm prefers.
  20. Step 20
    The Jacobian collects first-order partial derivatives of a vector-valued function, while the Hessian collects second-order partial derivatives of a scalar function. Together they describe local sensitivity and curvature, which is why they are central to optimization and dynamical analysis.
  21. Step 21
    k-fold cross-validation rotates a held-out fold through the dataset so every example is used for validation once and training the other times. It uses limited data efficiently for model selection, but it costs multiple training runs and must keep preprocessing inside each fold.
  22. Step 22
    k-nearest neighbors predicts by finding the k closest labeled training points to a query and then voting or averaging their labels. It has almost no training phase, but its predictions depend heavily on the distance metric and it degrades in high-dimensional spaces.
  23. Step 23
    L1 regularization adds a penalty proportional to the sum of absolute parameter values, encouraging many coefficients to become exactly zero. That sparsity makes Lasso useful when feature selection is part of the goal, not just shrinkage.
  24. Step 24
    The L1 norm sums absolute values and tends to promote sparsity when used as a penalty, while the L2 norm measures Euclidean length and tends to shrink weights smoothly without zeroing many of them out. That difference is why L1 is associated with feature selection and L2 with stable shrinkage.
  25. Step 25
    L2 regularization adds a penalty proportional to the sum of squared parameter values, shrinking weights toward zero without usually making them exactly sparse. In plain SGD it is equivalent to weight decay and is widely used because it improves stability and reduces variance.
  26. Step 26
    The law of total probability computes an event probability by summing over mutually exclusive, exhaustive cases. In machine learning it is the basic marginalization identity behind latent-variable models, mixture models, and many Bayesian calculations.
  27. Step 27
    Linear regression models a target as an affine function of the inputs, typically \( y \approx w^\top x + b \), and fits the parameters by minimizing squared residuals. It is the canonical baseline for regression because it is interpretable and often has a closed-form OLS solution.
  28. Step 28
    Linear regression fits an affine function to predict a continuous target by minimizing squared residuals between predictions and observed values. In matrix form, ordinary least squares projects the target vector onto the column space of the design matrix, yielding the normal equations in the full-rank case.
  29. Step 29
    Logistic regression is a linear classifier that models the log-odds of a class as \( w^\top x + b \) and maps that score through a sigmoid to get a probability. Despite its name, it is a classification model, not a regression model.
  30. Step 30
    A loss function maps a model's prediction and the true target to a scalar error signal that training aims to minimize. It defines what the model is optimized for, so changing the loss changes which mistakes are treated as costly.
  31. Step 31
    Mean squared error averages the squared difference between predicted and true values, making large errors count disproportionately more than small ones. For regression it is especially important because minimizing MSE is equivalent to maximum likelihood under Gaussian noise.
  32. Step 32
    Mini-batch gradient descent estimates the gradient on a small subset of training examples at each update instead of on the full dataset or a single example. It is the practical default in deep learning because it balances hardware efficiency with optimization noise.
  33. Step 33
    Momentum accumulates a running velocity of past gradients so updates keep moving in consistent directions and damp noisy zig-zags. It speeds optimization in ravines and is commonly paired with SGD or Nesterov variants.
  34. Step 34
    Naive Bayes is a probabilistic classifier that applies Bayes’ theorem under the simplifying assumption that features are conditionally independent given the class. That assumption is usually false, but the model is still fast, data-efficient, and surprisingly effective for sparse text problems.
  35. Step 35
    The OLS closed-form solution is the exact least-squares answer computed directly from the design matrix rather than by iterative optimization. In the full-rank case it solves the normal equations, and geometrically it projects the target vector onto the column space of the features.
  36. Step 36
    PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
  37. Step 37
    Prediction error is the difference between a model's prediction and the true target for an example. It is the atomic quantity from which losses, residual analysis, and generalization metrics are built.
  38. Step 38
    Regularization is any technique that biases learning toward simpler, more stable, or less overfit solutions. It can appear as an explicit penalty such as weight decay or as an implicit training choice such as data augmentation, dropout, or early stopping.
  39. Step 39
    RMSProp uses an exponential moving average of squared gradients to normalize updates, preventing the denominator from growing without bound as in AdaGrad. It is useful for nonstationary problems and was a key precursor to Adam.
  40. Step 40
    Stochastic gradient descent updates parameters using a gradient estimate from one example or a very small random batch, making each step noisy but cheap. That noise can slow exact convergence yet often helps large models optimize and generalize in practice.
  41. Step 41
    A train/validation/test split separates fitting, model selection, and final evaluation into different datasets. The test set is kept untouched until the end so it remains a credible estimate of out-of-sample performance.
  42. Step 42
    Weight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
  43. Step 43
    Xavier or Glorot initialization chooses weight variance from fan-in and fan-out so activations and gradients stay roughly stable across deep layers. It is well suited to symmetric activations such as tanh, while ReLU networks usually prefer He initialization.
  44. Step 44
    Anomaly detection identifies observations that look unlikely under the pattern of normal data. The main families are density-based methods, reconstruction-based methods, and one-class classification methods, and the right choice depends on whether you have labels, strong feature engineering, or only normal examples.
  45. Step 45
    The history of backpropagation is the story of an idea known in pieces before it became a practical neural-network training method. Werbos articulated reverse-mode differentiation for network training in the 1970s, and Rumelhart, Hinton, and Williams turned it into the landmark 1986 demonstration that made multilayer neural networks trainable in practice.
  46. Step 46
    The Brier score measures the mean squared error of probabilistic predictions, so it rewards both correctness and calibration. Lower is better, and unlike accuracy it penalizes a confidently wrong 0.99 prediction much more than a cautious 0.6 prediction.
  47. Step 47
    In optimization, convergence means an algorithm's iterates approach a stable solution or stationary point as updates continue. In practice people often mean that the loss or parameters stop changing much, though true convergence depends on the objective and algorithmic assumptions.
  48. Step 48
    Dropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
  49. Step 49
    Early stopping regularizes training by halting optimization when validation performance stops improving and keeping the best checkpoint seen so far. It works because prolonged optimization can eventually fit noise or idiosyncrasies of the training set rather than signal.
  50. Step 50
    Generalization is a model's ability to perform well on unseen data from the same underlying distribution as its training data. It is the real goal of learning, because low training error alone can come from memorization rather than useful structure.
  51. Step 51
    A hyperparameter is a setting chosen outside the optimization loop, such as learning rate, model width, regularization strength, or batch size. Unlike learned parameters, hyperparameters govern how the model is trained or structured and are usually selected by validation.
  52. Step 52
    The learning rate is the scalar that sets how large each optimization step is when parameters are updated. If it is too high training can diverge or oscillate, and if it is too low training can become extremely slow or get stuck.
  53. Step 53
    Model compression reduces a model’s memory, latency, or energy cost while trying to preserve performance. Common compression methods include distillation, pruning, quantization, low-rank factorization, and architecture redesign.
  54. Step 54
    Overfitting happens when a model fits patterns specific to the training set, including noise, better than it captures the underlying data-generating structure. The usual symptom is low training error paired with substantially worse validation or test error.
  55. Step 55
    Data leakage is any contamination that lets training or validation use information that would not be available at prediction time. Target leakage is the specific case where features encode the label or a post-outcome proxy for it, so every target leakage problem is data leakage, but not every data leakage problem is target leakage.
  56. Step 56
    The four atomic discrete distributions: Bernoulli \( (p) \) for a single binary trial, Binomial \( (n, p) \) for a sum of \( n \) such trials, Categorical \( (\pi) \) for a single \( K \)-class outcome (softmax target), and Multinomial \( (n, \pi) \) for counts across \( n \) such trials. They form the likelihood backbone of logistic regression, cross-entropy training, and count models.
  57. Step 57
    Given i.i.d. samples \( X_1, \dots, X_n \) with finite mean \( \mu \) and variance \( \sigma^2 \), the standardised sample mean \( \sqrt{n}(\bar X_n - \mu)/\sigma \) converges in distribution to \( \mathcal{N}(0,1) \). The CLT underlies confidence intervals, stochastic-gradient noise analysis, and many initialisation arguments.
  58. Step 58
    The factorisation \( p(x_1, \dots, x_n) = \prod_{i=1}^n p(x_i \mid x_{<i}) \) that decomposes any joint distribution into a product of conditionals. It is the mathematical bedrock of autoregressive language models, belief networks, and most tractable density estimation.
  59. Step 59
    A single identity \( H(p,q) = H(p) + D_{\text{KL}}(p \| q) \) ties the three together: entropy is the optimal code length under \( p \); cross-entropy is the code length using \( q \); KL is the extra cost of the mismatch. This view clarifies why minimising classification cross-entropy is equivalent to MLE and to minimising KL.
  60. Step 60
    The three dominant matrix norms measure different notions of size: Frobenius \( \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2} \) is the entrywise \( \ell_2 \); spectral (operator) \( \|A\|_2 = \sigma_{\max}(A) \) is the largest gain on unit vectors; nuclear \( \|A\|_* = \sum_i \sigma_i(A) \) is the convex surrogate for rank. Each appears in a distinct ML context: regularisation, Lipschitz bounds, and low-rank recovery respectively.
  61. Step 61
    The rank of \( A \in \mathbb{R}^{m \times n} \) is the dimension of its column space, equal to the dimension of its row space. The rank–nullity theorem states \( \text{rank}(A) + \text{nullity}(A) = n \), linking how much a linear map preserves to how much it collapses — the foundation for invertibility conditions, low-rank adaptation, and OLS identifiability.
  62. Step 62
    A symmetric matrix \( A \in \mathbb{R}^{n \times n} \) is positive semi-definite (PSD) if \( x^\top A x \ge 0 \) for all \( x \), and positive definite (PD) if strict for \( x \ne 0 \). Equivalent characterisations include non-negative eigenvalues and a Cholesky factorisation \( A = L L^\top \). PSD structure underlies covariance matrices, Gram matrices, convex Hessians, and every kernel method.
  63. Step 63
    A random variable \( X \) is a measurable map from a probability space \( (\Omega, \mathcal{F}, P) \) to \( \mathbb{R} \). Its expectation \( \mathbb{E}[X] = \int X\,dP \) and variance \( \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \) are the first two moments. Beyond textbook formulas, this axiomatic view explains why expectations linearise sums, exist iff \( \mathbb{E}|X| < \infty \), and commute with limits under uniform integrability.
  64. Step 64
    Approximates a smooth function near a point \( a \) by a polynomial whose coefficients are the function's derivatives: \( f(x) \approx \sum_{k=0}^{K} \frac{f^{(k)}(a)}{k!}(x-a)^k \). Provides the theoretical scaffolding for gradient descent (1st order), Newton's method (2nd order), Laplace approximations, and loss-landscape analysis.
  65. Step 65
    Variance measures the spread of a single random variable; covariance measures joint variation of two; correlation normalises covariance to \( [-1, 1] \) and is scale-free. Together they form the second-order statistics that drive PCA, linear regression, Kalman filters, and most initialisation schemes.
  66. Step 66
    An ablation study removes or alters one component of a system to measure how much that component actually contributes. Experimental control matters because an ablation is only informative when the comparison keeps everything else fixed, including data, tuning budget, and evaluation protocol.
  67. Step 67
    Backpropagation through time trains a recurrent network by unrolling it across sequence steps and applying backpropagation to the resulting deep computational graph. It exposes how earlier states influence later losses, but long unrolls make optimization and memory use difficult.
  68. Step 68
    Bayes risk is the minimum achievable expected loss under the true data distribution, and the Bayes-optimal classifier attains it by minimizing posterior expected loss for each input. Under ordinary 0–1 loss, that rule becomes “predict the class with highest posterior probability.”
  69. Step 69
    MC dropout estimates predictive uncertainty by keeping dropout active at test time and averaging many stochastic forward passes. Deep ensembles train several independently initialized models and usually give stronger uncertainty estimates, at higher training and serving cost.
  70. Step 70
    A Bayesian network is a DAG \( G \) over variables whose joint factorises as \( p(x) = \prod_i p(x_i \mid \text{pa}_G(x_i)) \). D-separation reads conditional independences off the graph; parameter learning uses MLE under complete data, EM under latent variables. The mathematical foundation of structured probabilistic modelling.
  71. Step 71
    The Beta \( (\alpha, \beta) \) is the conjugate prior for Bernoulli/Binomial likelihoods; the Dirichlet \( (\boldsymbol{\alpha}) \) is its \( K \)-class generalisation, conjugate to Categorical/Multinomial. Both live on probability simplices and make Bayesian updating a matter of adding counts to hyperparameters — the cleanest introduction to conjugate Bayesian inference and the scaffolding for LDA.
  72. Step 72
    The bias-variance tradeoff describes how expected squared test error decomposes into bias, variance, and irreducible noise. Increasing model flexibility usually lowers bias but raises variance, so the best generalization often lies between underfitting and overfitting.
  73. Step 73
    Binary cross-entropy is the cross-entropy loss for a Bernoulli target, typically \( -[y\log \hat p + (1-y)\log(1-\hat p)] \). It is the standard loss for binary classification and for multi-label problems where each label is predicted independently.
  74. Step 74
    Bootstrap confidence intervals estimate uncertainty by resampling the observed dataset with replacement and recomputing the statistic many times. They are useful when analytic standard errors are awkward, but they inherit the sample's biases and can fail when the original sample is too small or unrepresentative.
  75. Step 75
    Calibration measures whether predicted probabilities match observed frequencies, so events predicted at 70% should occur about 70% of the time. A model can be accurate but poorly calibrated if its confidence is systematically too high or too low.
  76. Step 76
    Hinton's alternative to CNN pooling: neurons are grouped into 'capsules' whose vector output encodes both existence and pose of an entity. Dynamic routing by agreement replaces max-pool, so each capsule decides which higher-level capsule to vote for based on agreement. Historically significant; practically superseded by transformers.
  77. Step 77
    Every symmetric positive-definite matrix \( A \) factors uniquely as \( A = LL^\top \) with \( L \) lower triangular with positive diagonal. Cholesky is twice as fast as LU, numerically stable without pivoting, and the standard building block for Gaussian-process inference, Kalman filters, and sampling from multivariate normals.
  78. Step 78
    Class imbalance means some labels are much rarer than others, so an unweighted objective can be dominated by the majority class. Reweighting changes the loss or sampling scheme so rare classes exert more influence during training.
  79. Step 79
    A discriminative model of \( p(y \mid x) = \tfrac{1}{Z(x)} \exp \sum_k \lambda_k f_k(y, x) \) over structured outputs. Linear-chain CRFs add sequence-level constraints on top of per-token scores, enabling tractable training via forward–backward and Viterbi decoding — still the backbone of NER/tagging heads above neural encoders.
  80. Step 80
    Confounders create misleading associations because they affect both treatment and outcome, while colliders create bias when you condition on them. Simpson’s paradox is the visible symptom that aggregate and stratified associations can reverse direction when the underlying causal structure is ignored.
  81. Step 81
    Conjugate gradient (CG) solves \( Ax = b \) for symmetric PD \( A \) using only matrix-vector products, converging in at most \( n \) steps in exact arithmetic and much faster when the eigenvalue spectrum is clustered. CG underlies truncated-Newton optimisation, the natural-gradient in K-FAC, and large-scale Gaussian-process inference.
  82. Step 82
    A prior is conjugate to a likelihood family if the posterior belongs to the same family as the prior. Conjugacy turns Bayesian updating into a closed-form parameter update and underlies analytical treatments of Beta–Binomial, Dirichlet–Multinomial, and Normal–Normal models.
  83. Step 83
    A convex function is one whose value on any line segment lies below the chord connecting that segment's endpoints. This matters in optimization because convex problems have no spurious local minima: every local minimum is global.
  84. Step 84
    A convex optimization problem minimises a convex function over a convex feasible set. Its defining property: every local minimum is global. Convexity also gives polynomial-time algorithms with strong guarantees — the backdrop against which non-convex deep learning is understood as a deliberate departure.
  85. Step 85
    Cost-sensitive learning assigns different penalties to different kinds of mistakes instead of treating every error equally. It is the right framework when the real objective is to minimize downstream harm or utility loss rather than raw misclassification rate.
  86. Step 86
    A counterfactual asks what would have happened under a different action or treatment than the one that actually occurred. The central difficulty is that for any individual unit, only one potential outcome is observed, so causal inference always requires assumptions to recover the missing alternative.
  87. Step 87
    Cross-entropy measures the average coding cost of samples from a true distribution \( p \) when encoded using a model distribution \( q \). In ML it is the standard loss for classification and language modeling, and minimizing it is equivalent to maximum likelihood up to an entropy constant.
  88. Step 88
    Curriculum learning trains a model on examples in an organized order, usually from easier or more structured cases to harder ones. The idea is to improve optimization and generalization by shaping the training distribution, though a bad curriculum can also slow learning or bias the model.
  89. Step 89
    DARE sparsifies fine-tuning deltas by randomly dropping many update entries and rescaling the rest, preserving most behavior while greatly reducing storage and merge interference. It is commonly used as a preprocessing step for delta compression or model merging rather than as a standalone training method.
  90. Step 90
    Data parallelism replicates the model on multiple devices and splits each batch across them, synchronizing gradients after each step. It is the simplest way to scale training throughput, but every device still stores the full model unless sharding is added.
  91. Step 91
    Data-centric AI treats data quality, labeling, coverage, and feedback loops as first-class levers of model performance rather than focusing only on architecture or hyperparameters. The core workflow is to diagnose failure slices, improve the dataset that generates those failures, and measure whether the model gets better for the right reasons.
  92. Step 92
    DBSCAN clusters points by density: a point is a core point if it has at least \( \text{minPts} \) neighbours within radius \( \varepsilon \); clusters are connected components of core-point neighbourhoods. Non-core points without core neighbours are labelled noise. Unlike \( k \)-means, DBSCAN does not need the number of clusters in advance and handles arbitrary cluster shapes, at the cost of \( \varepsilon \) tuning.
  93. Step 93
    Distribution shift occurs when the joint distribution seen at deployment differs from the one used for training or validation. The main cases are covariate shift, label shift, and concept shift; each breaks generalization in a different way and therefore requires different detection and mitigation strategies.
  94. Step 94
    The EM algorithm is an iterative method for maximum-likelihood or MAP estimation in models with latent variables. Each round first estimates the hidden structure under the current parameters and then re-optimizes the parameters as if that hidden structure were known.
  95. Step 95
    Measures miscalibration: \( \text{ECE} = \sum_b (|B_b|/n) \, |\text{acc}(B_b) - \text{conf}(B_b)| \). Predictions are bucketed by confidence; the gap between mean confidence and accuracy in each bucket sums to ECE. Deep nets are typically over-confident; temperature scaling restores calibration.
  96. Step 96
    Factor analysis models observed variables as linear combinations of a small number of latent factors plus variable-specific noise. It is useful when the goal is to explain covariance structure rather than merely reduce dimension, which is the key difference from PCA.
  97. Step 97
    Train a shared model across many clients (phones, hospitals) without centralising data. Each round: clients train locally for a few epochs, server averages their weight updates. Introduces non-IID-data, communication-cost, and privacy / security challenges absent in centralised training.
  98. Step 98
    Focal loss \( \text{FL}(p_t) = -(1-p_t)^\gamma \log p_t \) down-weights the loss contribution of confidently-classified easy examples, focusing gradient on hard ones. Designed for extreme foreground-background imbalance in one-stage detection; widely used in segmentation and long-tailed classification.
  99. Step 99
    A density model \( p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k) \) that represents data as a weighted sum of Gaussian components. Fitted by the EM algorithm, GMMs are the canonical example of latent-variable density estimation and the statistical cousin of \( k \)-means.
  100. Step 100
    A special case of Metropolis–Hastings: cycle through variables, sampling each from its full conditional \( p(\theta_j \mid \theta_{-j}, D) \). When the conditionals are tractable (conjugate priors, mixture models, LDA), Gibbs is simple, always accepts, and has been the workhorse of Bayesian inference for four decades.
  101. Step 101
    Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
  102. Step 102
    GAT (Veličković et al., 2018) replaces GCN's fixed degree-normalised aggregation with attention : each node learns per-edge weights via a shared attention mechanism. This gives inductive generalisation (no dependence on the full graph's degree matrix), handles heterogeneous neighbourhoods, and approaches Transformer-style flexibility — though at higher computational cost than GCN.
  103. Step 103
    Kipf & Welling's GCN (2017) applies a first-order spectral convolution on graphs: each node's representation is updated as a normalised sum of its neighbours' features, \( H^{(\ell+1)} = \sigma(\tilde D^{-1/2} \tilde A \tilde D^{-1/2} H^{(\ell)} W^{(\ell)}) \). The symmetric normalisation comes from a spectral argument; practically, it is the simplest and most widely-taught GNN layer, setting the template for all message-passing architectures.
  104. Step 104
    Predecessor to ResNet: a gated skip connection \( y = H(x) \cdot T(x) + x \cdot (1 - T(x)) \), where \( T(x) \in [0,1] \) is a learned transform gate. Enabled training of 100+ layer networks before residual connections simplified the construction in ResNet.
  105. Step 105
    Hinge loss penalizes examples that are misclassified or that lie too close to the decision boundary. It is the convex margin-based loss underlying soft-margin SVMs and emphasizes confident separation rather than calibrated probabilities.
  106. Step 106
    A hypothesis test compares a null \( H_0 \) to an alternative \( H_1 \) by computing a test statistic and its tail probability under \( H_0 \) — the p-value. Statistical power is \( 1 - \beta \), the probability of rejecting \( H_0 \) when \( H_1 \) is true. For ML evaluation, these are the tools that separate "this model is better" from "this split was lucky".
  107. Step 107
    Importance sampling estimates an expectation under a target distribution by drawing samples from a different proposal distribution and reweighting them. It is powerful when the proposal places more mass in the important regions of the integrand, but unstable weights can make the variance explode.
  108. Step 108
    ICA separates a linear mixture \( x = A s \) into statistically independent non-Gaussian sources by finding a de-mixing matrix \( W \) that maximises the non-Gaussianity of \( y = W x \). Classical application: the cocktail-party problem. Key distinction from PCA: maximises independence, not variance.
  109. Step 109
    InfoNCE maximises a mutual-information lower bound by classifying a positive pair against \( k \) negatives: \( \mathcal{L} = -\log \exp(s^+) / \sum_i \exp(s_i) \). NT-Xent is InfoNCE with temperature-scaled cosine similarities. Drives SimCLR, MoCo, CLIP, and most modern self-supervised representation learning.
  110. Step 110
    Two classical manifold-learning algorithms: Isomap replaces Euclidean distances with geodesic distances on a \( k \)-NN graph and applies MDS; LLE reconstructs each point from its neighbours' linear weights and finds a low-dimensional embedding that preserves those weights. Both set the conceptual stage for t-SNE and UMAP.
  111. Step 111
    Jensen’s inequality says that for a convex function f, applying f after taking an expectation gives a value no larger than taking the expectation after applying f. This one fact underlies many core results in ML, including the nonnegativity of KL divergence and the derivation of variational lower bounds.
  112. Step 112
    The Kalman filter recursively estimates the hidden state of a linear Gaussian dynamical system by alternating a prediction step with a measurement update. It is optimal for that model class because the posterior remains Gaussian and is fully described by a mean and covariance.
  113. Step 113
    Principal component analysis in the implicit feature space of a positive-definite kernel \( k(x, y) \). Eigendecomposes the centred Gram matrix \( K \) rather than the data covariance; recovers non-linear principal directions without ever instantiating the feature map. Used for non-linear dimensionality reduction and feature extraction.
  114. Step 114
    The Karush–Kuhn–Tucker (KKT) conditions generalise \( \nabla f = 0 \) to constrained optimisation, giving necessary (and, for convex problems, sufficient) conditions for a minimum. Lagrangian duality transforms the constrained primal into a dual over multipliers; for convex problems with Slater's condition, the duality gap closes — the workhorse derivation behind SVMs, interior-point methods, and much of RL.
  115. Step 115
    KL divergence measures how one probability distribution differs from a reference distribution through an expected log-ratio. It is nonnegative and asymmetric, so it is best understood not as a distance but as the penalty for modeling samples from one distribution with another.
  116. Step 116
    Knowledge distillation trains a smaller student model to match the outputs or internal representations of a larger teacher. It transfers some of the teacher’s behavior into a cheaper model, often using soft targets that contain more information than hard labels alone.
  117. Step 117
    Lagrange multipliers solve constrained optimization problems by introducing auxiliary variables that encode the constraints inside a single objective. At a constrained optimum, the gradient of the objective lies in the span of the constraint gradients, which is why the method is central to duality and SVM derivations.
  118. Step 118
    A likelihood ratio test compares how well two nested statistical models explain the same data by taking the ratio of their maximized likelihoods. Large likelihood-ratio statistics indicate that the larger model fits substantially better than the restricted one, and under regularity conditions the test statistic is asymptotically chi-squared.
  119. Step 119
    LDA and QDA are generative classifiers: model class-conditionals \( p(x \mid y = k) = \mathcal{N}(\mu_k, \Sigma_k) \) and apply Bayes' rule. LDA assumes shared covariance \( \Sigma \) across classes, giving linear decision boundaries and shrinkage-like regularisation; QDA uses per-class \( \Sigma_k \), giving quadratic boundaries at the cost of \( K \times d^2 \) parameters. Both are the generative counterparts of logistic regression.
  120. Step 120
    A LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
  121. Step 121
    LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
  122. Step 122
    The metric \( d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})} \) measures distance in a space whitened by the covariance \( \Sigma \). It is the natural distance for Gaussian data, the log-likelihood core of a multivariate Gaussian, and the basis of Fisher's discriminant analysis.
  123. Step 123
    Markov Chain Monte Carlo samples from a difficult target distribution by constructing a Markov chain whose stationary distribution matches that target. It is essential in Bayesian inference because it replaces intractable posterior integrals with averages over samples, provided the chain mixes well enough.
  124. Step 124
    Matrix calculus expresses derivatives of scalar, vector, and matrix-valued functions of matrix inputs in compact form, avoiding entrywise index chasing. Working with the differential \( dL = \text{tr}(G^\top dX) \) identifies \( \nabla_X L = G \) directly and turns backpropagation into linear algebra.
  125. Step 125
    Matrix factorisation decomposes a sparse user–item rating matrix \( R \approx UV^\top \) into low-rank user and item embeddings, learned by minimising squared error on observed entries with regularisation. This is the canonical collaborative-filtering recipe behind Netflix-style recommenders, the spiritual ancestor of modern embedding-based retrieval, and a clean worked example of low-rank learning.
  126. Step 126
    The maximum likelihood estimate selects the parameter values that make the observed data most probable under the model. Many standard ML objectives, including cross-entropy for classification and next-token prediction for LLMs, are just MLE written as minimization of negative log-likelihood.
  127. Step 127
    Memory optimization in ML training is the collection of techniques that reduce peak memory so larger models or batches fit on available hardware. Common examples are mixed precision, activation checkpointing, optimizer sharding, offloading, and more memory-efficient attention kernels.
  128. Step 128
    The canonical MCMC recipe: propose \( \theta' \sim q(\theta' \mid \theta) \), accept with probability \( \min(1, \pi(\theta')q(\theta\mid\theta')/[\pi(\theta)q(\theta'\mid\theta)]) \). Produces a Markov chain with stationary distribution \( \pi \) for any valid proposal, turning intractable posterior sampling into a correctness-guaranteed iterative procedure.
  129. Step 129
    Missing-data methods try to preserve inference when some values are unobserved by modeling why data are missing and how to fill or integrate over the missing entries. The key distinction is between MCAR, MAR, and MNAR, because imputation is far safer when missingness can be treated as conditionally ignorable.
  130. Step 130
    Model merging combines the weights or weight deltas of separately fine-tuned models into one model without full retraining. It can create multitask behavior cheaply, but naive averaging often causes interference unless the merged models share a common base and compatible parameter geometry.
  131. Step 131
    Model parallelism splits one model across multiple devices because it is too large or compute-heavy for a single device. The split can happen by layers, tensors, experts, or sequence chunks, trading memory savings for extra communication.
  132. Step 132
    Model soups average the weights of multiple fine-tuned models that lie in the same low-loss basin, often improving accuracy and robustness without extra inference cost. Unlike an ensemble, a soup is still a single model at serving time.
  133. Step 133
    For any \( A \in \mathbb{R}^{m \times n} \) with SVD \( A = U\Sigma V^\top \), the Moore–Penrose pseudoinverse is \( A^+ = V\Sigma^+ U^\top \), where \( \Sigma^+ \) inverts the non-zero singular values and transposes. \( A^+ b \) is the minimum-norm least-squares solution to \( Ax = b \) — the canonical generalisation of \( A^{-1} \) for rectangular and rank-deficient matrices.
  134. Step 134
    Multiple hypothesis testing asks how to control false positives when many tests are run at once. False discovery rate control, especially the Benjamini–Hochberg procedure, limits the expected fraction of rejected hypotheses that are actually null and is usually less conservative than family-wise error control.
  135. Step 135
    The multivariate Gaussian is the vector-valued generalization of the normal distribution, parameterized by a mean vector and covariance matrix. It is foundational because linear transformations, marginals, and conditionals all stay Gaussian, making analysis and inference unusually tractable.
  136. Step 136
    Conditional entropy \( H(Y \mid X) \) measures the residual uncertainty in \( Y \) given \( X \). Mutual information \( I(X; Y) = H(Y) - H(Y \mid X) \) measures how much \( X \) reduces the uncertainty in \( Y \) — equivalently, the KL divergence from the joint \( p(X,Y) \) to the product of marginals \( p(X)p(Y) \). Together they quantify dependence, drive information-bottleneck theory, and define decision-tree splits.
  137. Step 137
    Negative log-likelihood is the loss obtained by negating the log-likelihood, so maximizing probability becomes minimizing a positive objective. It is the standard training loss for probabilistic classifiers, language models, and many generative models.
  138. Step 138
    Newton's method uses the second-order model \( f(x+\Delta) \approx f(x) + g^\top \Delta + \tfrac{1}{2}\Delta^\top H \Delta \) to take the step \( \Delta = -H^{-1} g \). It converges quadratically near a minimum but is impractical in high dimensions. Quasi-Newton methods (BFGS, L-BFGS) approximate \( H^{-1} \) from gradient histories, retaining super-linear convergence at the cost of a Hessian solve per step.
  139. Step 139
    Learn an unnormalised model \( \tilde p_\theta(x) \) by training a binary classifier to distinguish data samples from noise samples. The classifier's logit becomes \( \log \tilde p_\theta(x) - \log q_{\text{noise}}(x) \), so the partition function is absorbed into a learnable constant. Foundation of word2vec's negative sampling and of InfoNCE contrastive learning.
  140. Step 140
    Factorise a nonnegative matrix \( V \approx W H \) with \( W, H \ge 0 \) entrywise. The nonnegativity constraint yields parts-based, interpretable components (topic–word, basis–image) and distinguishes NMF from PCA, whose sign-free components are typically holistic and hard to name.
  141. Step 141
    R-CNN ran a CNN classifier on externally-proposed regions; Fast R-CNN shared backbone features across proposals; Faster R-CNN introduced a learned Region Proposal Network; DETR replaced the entire region-proposal pipeline with a transformer that predicts a fixed set of boxes via bipartite matching.
  142. Step 142
    A one-class SVM learns a boundary around mostly normal data by separating mapped training points from the origin with maximum margin in feature space. It is a classic novelty-detection method because it needs examples of the inlier class but not labeled anomalies.
  143. Step 143
    Two non-parametric procedures for deciding whether model A beats model B with statistical significance. Paired bootstrap resamples matched predictions to estimate a confidence interval on the metric difference. McNemar's test uses a chi-squared on the \( 2\times 2 \) contingency table of agreement / disagreement.
  144. Step 144
    Platt scaling calibrates a classifier by fitting a one-dimensional logistic regression from raw scores to probabilities on a validation set. Isotonic regression is a more flexible monotonic alternative that can fit non-sigmoid calibration curves, but it usually needs more calibration data to avoid overfitting.
  145. Step 145
    PointNet processes an unordered point set by applying a shared MLP to each point, then pooling across points with a symmetric function (max-pool). Permutation-invariant by construction; PointNet++ adds local-region hierarchies to capture geometric structure.
  146. Step 146
    Poisson \( (\lambda) \) models counts of independent events in a fixed interval; Exponential \( (\lambda) \) models the waiting time between them. They are the two marginals of the Poisson process — one discrete, one continuous — and between them cover rare-event counts, waiting times, and the building blocks of survival analysis, rate modelling, and queueing theory.
  147. Step 147
    The potential outcomes framework defines causal effects by comparing the outcomes a unit would have under different treatments. Because only one of those potential outcomes is observed for any given unit, causal inference is fundamentally about identifying missing counterfactuals under defensible assumptions.
  148. Step 148
    A precision-recall curve shows how precision and recall trade off as the decision threshold moves through a ranked list of predictions. Average precision summarizes that curve and is especially informative when the positive class is rare.
  149. Step 149
    A scoring rule is proper if a forecaster minimizes expected score by reporting their true predictive distribution. Proper scoring rules matter because they reward honest, calibrated probabilities rather than merely getting the top-ranked class right.
  150. Step 150
    Proximal gradient methods solve objectives of the form \(f(x)+g(x)\) where \(f\) is smooth and \(g\) is nonsmooth but has an easy proximal operator. Each step does gradient descent on \(f\) and then a shrinkage- or projection-like update for \(g\), which is why the method underlies Lasso, sparse coding, and constrained convex optimization.
  151. Step 151
    Pruning removes weights, neurons, heads, or entire blocks that contribute little to performance. It can reduce model size or compute, but aggressive pruning usually needs fine-tuning to recover lost accuracy.
  152. Step 152
    Any \( A \in \mathbb{R}^{m \times n} \) with \( m \ge n \) factors as \( A = QR \), with \( Q \in \mathbb{R}^{m \times n} \) having orthonormal columns and \( R \in \mathbb{R}^{n \times n} \) upper triangular. It is the numerically preferred route to OLS, avoids squaring the condition number, and is the work-horse of least-squares solvers in every major numerical library.
  153. Step 153
    Semantic segmentation assigns a class label to every pixel; instance segmentation further distinguishes object instances (two cats become two masks). Panoptic segmentation unifies them: one label per pixel with 'thing' vs 'stuff' classes. Backbones: FCN, DeepLab, Mask R-CNN, DETR/Mask2Former.
  154. Step 154
    Train a shared encoder so that semantically similar inputs map to nearby embeddings (contrastive / triplet loss) or that query-key scores reflect similarity directly. Used in face verification, signature matching, image retrieval; the conceptual parent of SimCLR, CLIP, and BiEncoder retrieval.
  155. Step 155
    SLERP interpolates between two parameter vectors along the great-circle path on a sphere instead of using straight-line interpolation. In model merging it can preserve vector norms and sometimes produce smoother blends than linear interpolation when the two directions differ strongly.
  156. Step 156
    Construct a similarity graph over data points, embed each point into \( \mathbb{R}^k \) using the \( k \) smallest eigenvectors of the graph Laplacian, and run k-means in that space. The Laplacian's spectrum encodes cluster structure as low-frequency eigenmodes — works for non-convex clusters where Euclidean k-means fails.
  157. Step 157
    Structural risk minimization extends empirical risk minimization by balancing training fit against model complexity. It is the learning-theoretic principle behind regularization, margin control, and choosing among hypothesis classes of different capacity.
  158. Step 158
    A sufficient statistic is a summary of the sample that retains all information about a parameter relevant for inference. This is why many classical models can replace an entire dataset with counts, sums, or means without changing the likelihood-based conclusions about the parameter.
  159. Step 159
    Surrogate losses replace hard-to-optimize 0–1 classification loss with tractable objectives such as logistic or hinge loss. A surrogate is classification-calibrated if optimizing it still drives the classifier toward the Bayes-optimal decision rule.
  160. Step 160
    t-SNE embeds high-dimensional points into 2D/3D by matching a heavy-tailed joint \( Q \) in the embedding space to a Gaussian-kernel joint \( P \) over neighbourhoods in the input space, minimising \( D_{\text{KL}}(P\|Q) \). The heavy Student-t tail in \( Q \) solves the crowding problem; perplexity tunes neighbourhood size. t-SNE preserves local structure well but distorts global distances.
  161. Step 161
    A task vector is the weight difference between a pre-trained model and the same model after fine-tuning on a task. Adding, subtracting, or scaling that vector can steer behavior, so task vectors provide a simple weight-space tool for editing or combining capabilities.
  162. Step 162
    Temperature scaling calibrates a classifier by dividing logits by a learned scalar temperature before the softmax. It often improves probability calibration on a validation set without changing the model’s ranking of classes.
  163. Step 163
    The Fisher Information Matrix measures how sensitive a model's log-likelihood is to changes in its parameters and therefore captures local statistical curvature. It underlies asymptotic variance bounds and natural gradient methods because it defines a geometry tied to the model's predictive distribution.
  164. Step 164
    TIES-Merging is a model-merging method that reduces interference by trimming small parameter changes, resolving sign conflicts, and merging only updates aligned with an agreed sign. It is designed for cases where separately fine-tuned models disagree on which weights should move and in what direction.
  165. Step 165
    Triplet margin loss trains an embedding space so an anchor is closer to a positive example than to a negative example by at least a fixed margin. It is a standard metric-learning objective because it directly enforces relative similarity rather than predicting class labels.
  166. Step 166
    UMAP constructs a fuzzy topological representation of the data using local Riemannian metrics, then finds a low-dimensional embedding whose fuzzy topology matches. Its cross-entropy objective has distinct attractive and repulsive terms — easier to scale than t-SNE — and a theoretical motivation grounded in category theory and Riemannian geometry. In practice, UMAP preserves more global structure than t-SNE at similar or better speed.
  167. Step 167
    The universal approximation theorem says that a sufficiently wide neural network with a suitable nonlinearity can approximate any continuous function on a compact domain arbitrarily well. It is an existence result, not a guarantee that training will find that approximation efficiently.
  168. Step 168
    A framework that turns Bayesian inference into optimisation: choose a tractable family \( q(\mathbf{z}; \phi) \) and maximise the evidence lower bound \( \mathcal{L}(\phi) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \), which simultaneously approximates the posterior and bounds the log-evidence. VAEs, variational Bayes, and amortised inference all descend from this objective.
  169. Step 169
    The vanishing gradient problem is the tendency for gradients propagated through many layers or time steps to shrink exponentially, making early parameters learn extremely slowly. It is especially severe in deep sigmoid or tanh networks and was a main motivation for LSTMs, better initialization, and residual connections.
  170. Step 170
    Three production gradient-boosted-decision-tree implementations with distinct tree-construction strategies: XGBoost does level-wise exact/approximate splits with second-order Taylor objective; LightGBM uses histogram-based leaf-wise growth and GOSS subsampling; CatBoost uses ordered boosting to avoid target leakage with categorical features.
  171. Step 171
    Single-stage detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from a single CNN forward pass. YOLOv1 was real-time but coarse; YOLOv3–v10 progressively adopted anchor boxes, FPN, CSP blocks, decoupled heads, and finally anchor-free / NMS-free designs for edge deployment.
  172. Step 172
    Friston's free-energy principle frames perception, learning, and action as minimising a single quantity — variational free energy, a KL-plus-accuracy bound on surprise. Active inference extends the principle to behaviour: agents select actions that minimise expected free energy, simultaneously seeking reward and information. It is a generative-model-based alternative to classical RL with tight links to variational inference, ELBO, and exploration–exploitation.
  173. Step 173
    Adversarial robustness studies how learned models can be forced into wrong predictions by carefully chosen perturbations that are small, structured, or semantically deceptive. Modern attack families include gradient-based \(\ell_p\) attacks, universal perturbations, adversarial patches, and transfer attacks; defenses must avoid merely hiding gradients while leaving the model fragile.
  174. Step 174
    A message-passing algorithm that computes exact marginals on tree-structured graphical models in linear time and approximate marginals (loopy BP) on general graphs. Each node sends summarising messages to neighbours; final beliefs equal the product of incoming messages.
  175. Step 175
    Canonical correlation analysis finds linear combinations of two random vectors that are maximally correlated with each other. It is the right tool when the question is about shared structure between two views of the same examples rather than variance within a single view.
  176. Step 176
    High-probability bounds on how far a sum or function of independent random variables can deviate from its mean. Hoeffding uses boundedness, Bernstein exploits known variance for a tighter bound, and McDiarmid handles functions whose value changes little when any single argument changes — the workhorses behind PAC / generalization proofs.
  177. Step 177
    A distribution-free procedure for turning any point predictor into a prediction set with guaranteed finite-sample coverage \( 1 - \alpha \) under exchangeability. Requires only a scoring function and a calibration set; no assumption on the underlying model or data distribution.
  178. Step 178
    A hybrid CNN-plus-attention block for speech recognition: each Conformer layer combines multi-head self-attention (global), depthwise convolutions (local), and sandwiched feed-forward modules. Outperforms pure Transformer and pure CNN on LibriSpeech and became the de facto encoder for production ASR.
  179. Step 179
    The Cramér–Rao bound states that any unbiased estimator \( \hat\theta \) of a parameter \( \theta \) has variance \( \text{Var}(\hat\theta) \ge 1/I(\theta) \), where \( I(\theta) \) is the Fisher information. It is the foundational efficiency bound of classical statistics and gives an immediate lower bound on the uncertainty of MLE-based procedures.
  180. Step 180
    Two objectives for training sequence-to-sequence models when alignment between input and output frames is unknown. CTC sums over all alignment paths with blank symbols; RNN-T decomposes into a prediction network and joint network to model output-length independently. Backbones of modern ASR pipelines.
  181. Step 181
    Formal guarantee that an algorithm's output distribution barely changes if any single training example is replaced. DP-SGD achieves \( (\varepsilon, \delta) \)-DP by clipping per-example gradients and adding calibrated Gaussian noise. Central to privacy-preserving ML training on sensitive data.
  182. Step 182
    Do-calculus is Pearl's set of graphical transformation rules for turning interventional quantities into observational quantities when the causal graph permits it. It matters because it separates what can be identified from data plus structure from what remains fundamentally unidentifiable.
  183. Step 183
    Double descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.
  184. Step 184
    Two derivations of the ELBO: (i) Jensen's inequality applied to \( \log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z})\,d\mathbf{z} \), and (ii) the identity \( \log p(\mathbf{x}) = \text{ELBO}(q) + D_{\text{KL}}(q \| p(\cdot \mid \mathbf{x})) \). Both produce the same bound, but the second makes the gap explicit.
  185. Step 185
    For any convex \( f \) with \( f(1) = 0 \), the \( f \)-divergence \( D_f(P \| Q) = \mathbb{E}_Q[f(dP/dQ)] \) recovers KL (\( f = t \log t \)), reverse KL, Jensen–Shannon, total variation, \( \chi^2 \), Hellinger, and α-divergences as special cases. The variational (Fenchel) form underlies f-GAN and density-ratio estimation.
  186. Step 186
    Fully Sharded Data Parallel shards model parameters, gradients, and optimizer states across data-parallel workers, gathering full parameters only when needed for computation. It is the PyTorch analogue of ZeRO-style training and makes much larger models fit without custom model-parallel code.
  187. Step 187
    A Gaussian process defines a distribution over functions such that any finite set of evaluations is jointly Gaussian. Given a kernel \( k \) and noisy observations, the posterior at a test point is \( \mathcal{N}(\mu_*, \sigma_*^2) \) with closed-form mean and variance. GP regression gives both predictions and calibrated uncertainty, costs \( O(n^3) \) for \( n \) training points, and is the Bayesian counterpart of kernel ridge regression.
  188. Step 188
    HMC augments the target distribution with an auxiliary momentum variable and simulates Hamiltonian dynamics so that proposals move long distances while staying on near-constant-energy shells. It explores complex posteriors dramatically faster than random-walk Metropolis; the No-U-Turn Sampler (NUTS) adapts the integration length automatically.
  189. Step 189
    Instrumental variables identify causal effects when treatment is confounded, provided an instrument affects treatment, is as-good-as random with respect to unobserved confounders, and influences the outcome only through treatment. In simple linear settings, the IV estimand is the ratio of the instrument-outcome covariance to the instrument-treatment covariance.
  190. Step 190
    Langevin MCMC treats gradient noise as deliberate: proposals drift along \( -\nabla \log \pi(\theta) \) plus Gaussian perturbation so the chain targets \( \pi \). Metropolis-adjusted Langevin (MALA) corrects discretisation bias with an MH acceptance; unadjusted Langevin (ULA) trades bias for simplicity and scales to big data via stochastic gradients (SGLD).
  191. Step 191
    Mercer's theorem shows that a continuous positive-semidefinite kernel can be expanded in nonnegative eigenvalues and orthogonal eigenfunctions. This makes kernel functions equivalent to inner products in a possibly infinite-dimensional feature space and motivates the kernel trick.
  192. Step 192
    MAML learns an initialisation \( \theta \) such that one or a few SGD steps on a new task yield good performance — formally \( \min_\theta \sum_\tau \mathcal{L}_\tau(\theta - \alpha \nabla \mathcal{L}_\tau(\theta)) \). Reptile is a first-order simplification that moves \( \theta \) toward per-task adapted parameters. Influential early 'learn to learn' recipes, since absorbed into prompt-based few-shot learning.
  193. Step 193
    The natural gradient \( \tilde\nabla_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L} \) preconditions the Euclidean gradient by the inverse Fisher information matrix, yielding steepest descent under the KL-divergence metric on the statistical manifold. It underlies K-FAC, Shampoo, TRPO's trust region, and the original motivation for reparameterisation-invariant optimisation.
  194. Step 194
    The Neural Tangent Kernel is the kernel that describes how an infinitely wide network trained by small gradient steps evolves around its initialization. In that limit, training becomes equivalent to kernel regression, which explains part of the behavior of very wide networks.
  195. Step 195
    The No-Free-Lunch theorem says that averaged uniformly over all possible target functions, no learning algorithm outperforms any other. In machine learning it means performance gains must come from inductive bias that matches the structure of the problems we actually care about.
  196. Step 196
    PAC-Bayes bounds the generalization gap of a stochastic classifier \( Q \) by a Kullback–Leibler term against a data-independent prior \( P \): with probability \( 1 - \delta \), \( \mathbb{E}_{h\sim Q}[R(h)] \le \mathbb{E}_{h\sim Q}[\hat R(h)] + O(\sqrt{\text{KL}(Q\|P)/n}) \). Used for non-vacuous bounds on overparameterised networks.
  197. Step 197
    A particle filter approximates the posterior over a hidden state with weighted samples, or particles, instead of a single Gaussian. It is useful for nonlinear or non-Gaussian state-space models, but resampling and weight degeneracy are central practical issues.
  198. Step 198
    Pipeline parallelism partitions a model by layers across devices and sends microbatches through the partitions like an assembly line. It reduces per-device memory, but pipeline bubbles and stage imbalance can waste throughput if the schedule is poorly tuned.
  199. Step 199
    QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
  200. Step 200
    The empirical Rademacher complexity of a function class \( \mathcal{F} \) on data \( S = (z_1, \dots, z_n) \) is \( \hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma[\sup_{f \in \mathcal{F}} \tfrac{1}{n}\sum_i \sigma_i f(z_i)] \) — the expected ability of \( \mathcal{F} \) to correlate with random \( \pm 1 \) signs. It is the data-dependent workhorse of modern generalisation bounds, usually tighter than VC, and gives direct norm-based bounds for deep networks.
  201. Step 201
    An RKHS is a Hilbert space of functions where evaluation at a point can be written as an inner product with a kernel function. The representer theorem says that many regularized empirical-risk problems in an RKHS have solutions that are finite sums of kernel evaluations at the training points, making kernel methods practical.
  202. Step 202
    Deep Sets: any permutation-invariant function on sets equals \( \rho(\sum_i \phi(x_i)) \) for learnable \( \phi, \rho \). Set Transformer replaces the sum with self-attention via Induced Set Attention Blocks, giving element-wise interactions while remaining permutation-equivariant.
  203. Step 203
    Minimise a loss whose value is worst-case over a small \( \rho \)-ball of weight perturbations: \( \min_\theta \max_{\|\varepsilon\| \le \rho} \mathcal{L}(\theta + \varepsilon) \). The ascent step \( \varepsilon^\star \approx \rho \, \nabla \mathcal{L}/\|\nabla \mathcal{L}\| \) biases training toward flat minima, improving generalisation across ViT, ResNet, and LLM finetuning.
  204. Step 204
    Uncertainty estimation in deep learning tries to quantify when a model should be unsure, not just what label it predicts. The key distinction is between aleatoric uncertainty from irreducible noise in the data and epistemic uncertainty from limited knowledge of the model, and modern methods such as deep ensembles, MC dropout, and conformal prediction target those uncertainties differently.
  205. Step 205
    The Vapnik–Chervonenkis dimension of a hypothesis class \( \mathcal{H} \) is the largest number of points \( \mathcal{H} \) can shatter — label in every possible way. It controls PAC generalisation: if \( \text{VC}(\mathcal{H}) = d \), then with probability \( 1 - \delta \) all \( h \in \mathcal{H} \) satisfy \( L(h) \le \hat L(h) + O(\sqrt{d/n}) \). VC dimension explains why linear classifiers in \( \mathbb{R}^d \) need \( \Omega(d) \) examples and why simple hypothesis classes generalise.
  206. Step 206
    The \( p \)-Wasserstein distance \( W_p(\mu,\nu) = \inf_{\gamma \in \Pi(\mu,\nu)} \big( \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|^p \big)^{1/p} \) measures the minimum cost of reshaping distribution \( \mu \) into \( \nu \). It underpins WGAN, flow matching, and a whole family of divergences that remain well-behaved when KL blows up.