Tag: foundational
233 topic(s)
- GPT-3 & Few-Shot In-Context LearningGPT-3 showed that a 175B-parameter autoregressive Transformer can perform many tasks from natural-language instructions and a few demonstrations in the prompt, without gradient updates or task-specific fine-tuning. That result made in-context learning a central paradigm and showed that scale alone could unlock strong few-shot behavior.
- The Bitter LessonThe Bitter Lesson is Sutton's argument that, over the long run, general methods that scale with compute and data outperform systems built around hand-crafted domain knowledge. It is a historical pattern claim, not a theorem, and its force comes from repeated examples in search, game playing, vision, and language.
- GloVe Word EmbeddingsGloVe learns word embeddings by fitting vector dot products to the log of global word-word co-occurrence counts. Because it is trained on ratios of co-occurrence statistics, linear relations such as king minus man plus woman approximately equals queen often emerge in the embedding space.
- Xavier/Glorot InitializationXavier or Glorot initialization chooses weight variance from fan-in and fan-out so activations and gradients stay roughly stable across deep layers. It is well suited to symmetric activations such as tanh, while ReLU networks usually prefer He initialization.
- ImageNet DatasetImageNet is a large, hierarchically labeled image dataset whose 1000-class ILSVRC benchmark became the defining testbed for modern computer vision. AlexNet's 2012 win on ImageNet triggered the deep learning shift by showing that GPU-trained CNNs could dramatically beat hand-engineered pipelines.
- Neural Probabilistic Language ModelThe Neural Probabilistic Language Model replaced count-based n-grams with learned word embeddings and a neural network that predicts the next word from a continuous representation of context. Its core contribution was showing that distributed representations let language models generalize to unseen but similar word sequences.
- Next-Token Prediction Objective (Causal Language Modeling)Next-token prediction trains a causal language model to assign high probability to each token given all previous tokens. Maximizing this likelihood over large text corpora teaches the model syntax, facts, and reusable patterns that later support prompting and generation.
- Byte Pair Encoding (BPE)Byte Pair Encoding is a subword tokenization method that repeatedly merges the most frequent adjacent symbols in a corpus. It builds a vocabulary between characters and whole words, which handles rare words better than word-level tokenization while keeping sequence lengths manageable.
- Softmax TemperatureSoftmax temperature rescales logits before softmax to control randomness in the output distribution. Lower temperature makes probabilities sharper and decoding more deterministic, while higher temperature flattens the distribution and increases diversity.
- Sinusoidal Positional EncodingSinusoidal positional encoding adds fixed sine and cosine patterns of different frequencies to token embeddings so the model can infer token order. The encoding is deterministic and smooth across positions, which let the original Transformer represent position without learning a separate table.
- Causal (Masked) Self-AttentionCausal masked self-attention is self-attention with a mask that prevents each position from attending to future tokens. Applying the mask before softmax enforces autoregressive order, so the model can predict the next token without seeing the answer in advance.
- Lagrange MultipliersLagrange multipliers solve constrained optimization problems by introducing auxiliary variables that encode the constraints inside a single objective. At a constrained optimum, the gradient of the objective lies in the span of the constraint gradients, which is why the method is central to duality and SVM derivations.
- Ordinary Least Squares (OLS) Closed-Form SolutionThe OLS closed-form solution is the exact least-squares answer computed directly from the design matrix rather than by iterative optimization. In the full-rank case it solves the normal equations, and geometrically it projects the target vector onto the column space of the features.
- PAC Learning (Probably Approximately Correct)PAC learning formalizes what it means for a hypothesis class to be learnable: with enough samples, an algorithm should return a hypothesis whose error is small with high probability. It is foundational because sample complexity and model capacity can then be expressed as rigorous guarantees instead of heuristics.
- Receiver Operating Characteristic (ROC) & AUCThe ROC curve plots true positive rate against false positive rate as a binary classifier's threshold changes, and AUC summarizes that curve into a single number. AUC also has a ranking interpretation: it is the probability that a random positive example scores above a random negative one.
- Convex FunctionA convex function is one whose value on any line segment lies below the chord connecting that segment's endpoints. This matters in optimization because convex problems have no spurious local minima: every local minimum is global.
- Shannon EntropyShannon entropy measures the expected surprisal of a random variable and quantifies how uncertain its outcomes are. It is the basic information-theoretic quantity from which cross-entropy, KL divergence, mutual information, and many ML loss functions are built.
- L1 vs. L2 NormsThe L1 norm sums absolute values and tends to promote sparsity when used as a penalty, while the L2 norm measures Euclidean length and tends to shrink weights smoothly without zeroing many of them out. That difference is why L1 is associated with feature selection and L2 with stable shrinkage.
- K-Means Objective Function (Inertia)The K-means objective, also called inertia, is the sum of squared distances from each point to its assigned cluster centroid. K-means greedily lowers that objective by alternating between reassigning points and recomputing centroids, though the result still depends on initialization because the problem is nonconvex.
- The Curse of DimensionalityThe curse of dimensionality is the collection of high-dimensional effects that make data sparse, neighborhoods less informative, and sample requirements explode as dimension grows. It helps explain why distance-based methods, density estimation, and exhaustive search often break down in large feature spaces.
- The Bellman EquationThe Bellman equation recursively expresses the value of a state or state-action pair as immediate reward plus discounted expected future value. It is the backbone of dynamic programming and reinforcement learning because it turns long-horizon return into a local consistency condition.
- The Markov PropertyThe Markov property says that the conditional distribution of the future depends only on the present state, not on the full past history, once the state is known. It is the defining assumption behind Markov chains and MDPs and tells you when a state representation is sufficient for planning.
- The Fisher Information MatrixThe Fisher Information Matrix measures how sensitive a model's log-likelihood is to changes in its parameters and therefore captures local statistical curvature. It underlies asymptotic variance bounds and natural gradient methods because it defines a geometry tied to the model's predictive distribution.
- Information Gain (Decision Trees)Information gain is the reduction in entropy achieved by splitting a dataset on a candidate feature. Decision-tree algorithms use it to choose splits that most reduce label uncertainty, though raw information gain can be biased toward features with many distinct values.
- DropoutDropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
- Singular Value Decomposition (SVD)Singular Value Decomposition factors any matrix into orthogonal directions and nonnegative singular values. It is fundamental because low-rank approximation, PCA, pseudoinverses, compression, and many denoising methods all follow from that decomposition.
- Maximum A Posteriori (MAP) EstimationMaximum a posteriori estimation chooses the parameter value that maximizes posterior probability given the data. It is equivalent to maximum likelihood plus a log-prior regularizer, which is why MAP connects Bayesian estimation to familiar penalized optimization objectives.
- Supervised learningSupervised learning trains a model on labeled input-output pairs so it can predict the correct target on new examples from the same distribution. Classification and regression are its two main forms, depending on whether the target is discrete or continuous.
- Unsupervised learningUnsupervised learning tries to discover structure in data without labeled targets, such as clusters, latent factors, or a density model. It is used for representation learning, dimensionality reduction, clustering, and generative modeling when explicit supervision is unavailable.
- Reinforcement learning (RL)Reinforcement learning studies how an agent should act through trial and error to maximize cumulative reward in an environment. Unlike supervised learning, feedback is delayed and depends on the agent's own actions, so the problem is about sequential decision-making as much as prediction.
- Linear functionA linear function satisfies additivity and homogeneity, so it can be written as a matrix map with no bias term. In machine learning people often use 'linear' loosely for affine maps, but mathematically the distinction matters because adding a bias breaks true linearity.
- Affine transformationAn affine transformation is a linear map followed by a translation, so it has weights and a bias. Dense neural network layers are affine rather than strictly linear, because the bias lets the model shift activations and decision boundaries.
- Loss functionA loss function maps a model's prediction and the true target to a scalar error signal that training aims to minimize. It defines what the model is optimized for, so changing the loss changes which mistakes are treated as costly.
- Mean squared error (MSE)Mean squared error averages the squared difference between predicted and true values, making large errors count disproportionately more than small ones. For regression it is especially important because minimizing MSE is equivalent to maximum likelihood under Gaussian noise.
- Prediction errorPrediction error is the difference between a model's prediction and the true target for an example. It is the atomic quantity from which losses, residual analysis, and generalization metrics are built.
- Gradient descentGradient descent minimizes a differentiable objective by repeatedly moving parameters in the direction of steepest local decrease, namely the negative gradient. Its step size is set by the learning rate, so convergence depends on both objective geometry and update scale.
- Mini-batch gradient descentMini-batch gradient descent estimates the gradient on a small subset of training examples at each update instead of on the full dataset or a single example. It is the practical default in deep learning because it balances hardware efficiency with optimization noise.
- Stochastic gradient descent (SGD)Stochastic gradient descent updates parameters using a gradient estimate from one example or a very small random batch, making each step noisy but cheap. That noise can slow exact convergence yet often helps large models optimize and generalize in practice.
- ConvergenceIn optimization, convergence means an algorithm's iterates approach a stable solution or stationary point as updates continue. In practice people often mean that the loss or parameters stop changing much, though true convergence depends on the objective and algorithmic assumptions.
- GeneralizationGeneralization is a model's ability to perform well on unseen data from the same underlying distribution as its training data. It is the real goal of learning, because low training error alone can come from memorization rather than useful structure.
- RegularizationRegularization is any technique that biases learning toward simpler, more stable, or less overfit solutions. It can appear as an explicit penalty such as weight decay or as an implicit training choice such as data augmentation, dropout, or early stopping.
- L1 regularization (Lasso)L1 regularization adds a penalty proportional to the sum of absolute parameter values, encouraging many coefficients to become exactly zero. That sparsity makes Lasso useful when feature selection is part of the goal, not just shrinkage.
- L2 regularization (Ridge/Weight Decay)L2 regularization adds a penalty proportional to the sum of squared parameter values, shrinking weights toward zero without usually making them exactly sparse. In plain SGD it is equivalent to weight decay and is widely used because it improves stability and reduces variance.
- Early stoppingEarly stopping regularizes training by halting optimization when validation performance stops improving and keeping the best checkpoint seen so far. It works because prolonged optimization can eventually fit noise or idiosyncrasies of the training set rather than signal.
- HyperparameterA hyperparameter is a setting chosen outside the optimization loop, such as learning rate, model width, regularization strength, or batch size. Unlike learned parameters, hyperparameters govern how the model is trained or structured and are usually selected by validation.
- Data leakageData leakage occurs when information that would not be available at prediction time leaks into training or model selection, causing overly optimistic evaluation. Common examples are fitting preprocessing on the full dataset, peeking at test labels, or using future information in time-series tasks.
- BackpropagationBackpropagation computes gradients of a scalar loss with respect to all network parameters by applying the chain rule backward through the computation graph. It makes deep learning practical because it turns a complicated nested function into reusable local gradient calculations.
- Forward passThe forward pass is the computation that maps input data through the model to produce activations and an output prediction. During training it also caches intermediate values needed later by the backward pass.
- Backward passThe backward pass propagates gradients from the loss back through the computation graph to determine how each parameter affected the final error. It uses stored forward-pass intermediates and the chain rule to accumulate derivatives efficiently.
- Chain ruleThe chain rule gives the derivative of a composition of functions by multiplying local derivatives along the computation path. It is the mathematical principle that backpropagation applies at scale throughout a neural network.
- Automatic differentiationAutomatic differentiation computes exact derivatives of a program by systematically composing derivatives of its primitive operations. Unlike symbolic differentiation it does not manipulate formulas, and unlike numerical differentiation it does not rely on finite-difference approximations.
- Computational graphA computational graph represents a calculation as nodes for variables or operations and edges for data dependencies. It is useful because the same graph that defines the forward computation can also be traversed backward to perform automatic differentiation.
- Neural networkA neural network is a parameterized function built by composing affine transformations with nonlinear activations across layers. Its power comes from learning representations from data rather than relying on hand-crafted features for each task.
- Learning rateThe learning rate is the scalar that sets how large each optimization step is when parameters are updated. If it is too high training can diverge or oscillate, and if it is too low training can become extremely slow or get stuck.
- EpochAn epoch is one complete pass through the training dataset. In mini-batch training, an epoch consists of many updates, one for each batch needed to cover the data once.
- OverfittingOverfitting happens when a model fits patterns specific to the training set, including noise, better than it captures the underlying data-generating structure. The usual symptom is low training error paired with substantially worse validation or test error.
- UnderfittingUnderfitting happens when a model is too limited, too constrained, or too poorly trained to capture the main structure in the data. It usually shows up as high error on both training and validation data, indicating high bias rather than variance.
- NeuronA neuron in a neural network computes a weighted sum of its inputs, adds a bias, and applies an activation function. Collections of neurons form layers, so a single neuron's role is simple even though many together can represent complex functions.
- Activation functionAn activation function is the nonlinear mapping applied after an affine transformation in a neural network. It is what prevents a stack of layers from collapsing into one affine map, enabling deep networks to approximate complex functions.
- SigmoidThe sigmoid function maps a real number to a value between zero and one, making it easy to interpret as a probability or gate. Its downside is saturation at large positive or negative inputs, which can cause vanishing gradients in deep networks.
- TanhThe tanh function maps inputs to the range minus one to one and is zero-centered, which often makes optimization easier than with the sigmoid. Like the sigmoid, however, it still saturates at large magnitudes and can cause vanishing gradients.
- ReLUReLU outputs the positive part of its input and zero otherwise. It became the default activation in many deep networks because it is simple, cheap, and far less prone to saturation than sigmoid or tanh, though units can still die if they stay on the zero side.
- SoftmaxSoftmax turns a vector of logits into a probability distribution by exponentiating and normalizing them so the components sum to one. It is commonly used for multiclass prediction because it converts arbitrary scores into class probabilities while preserving their ranking.
- Fully connected layerA fully connected layer applies an affine transformation in which every output unit depends on every input feature. It is the standard dense layer used in multilayer perceptrons and as a projection block inside many larger architectures.
- Input layerThe input layer is the entry point of a network, where raw or preprocessed features are presented to the model. Unlike hidden layers, it usually performs little or no learned computation by itself and mainly defines the representation the rest of the network receives.
- Hidden layerA hidden layer is any internal layer between the input and output of a network. Hidden layers transform raw inputs into increasingly useful intermediate representations that the final output layer can read out.
- Output layerThe output layer is the final transformation that maps a model's last hidden representation to a prediction space such as class logits, probabilities, or regression values. Its shape and activation depend on the task being solved.
- Multilayer perceptron (MLP)A multilayer perceptron is a feedforward neural network made of stacked fully connected layers and nonlinear activations. It is the canonical dense architecture for tabular function approximation and the feed-forward subnetwork inside many Transformer blocks.
- Deep neural networkA deep neural network is a neural network with multiple hidden layers rather than just one or two. The extra depth lets it build hierarchical features and represent complex functions more efficiently than shallow networks in many settings.
- Deep learningDeep learning is the study and practice of training multilayer neural networks on data to learn useful representations automatically. Its hallmark is end-to-end learning of features and predictors together, especially when large data and compute are available.
- Composite functionA composite function applies one function to the output of another, such as f of g of x. Neural networks are composite functions at scale, which is why gradients are computed by repeatedly applying the chain rule.
- Feedforward neural networkA feedforward neural network is a network whose computations move from input to output without recurrent cycles. Each layer depends only on earlier activations in the same pass, making feedforward networks the basic template for MLPs and many vision models.
- Convolutional neural network (CNN)A convolutional neural network uses learned convolution filters with local receptive fields and weight sharing to process grid-like data such as images. Those inductive biases make CNNs especially effective and parameter-efficient for visual pattern recognition.
- Recurrent neural network (RNN)A recurrent neural network processes sequences by maintaining a hidden state that is updated one step at a time from the current input and previous state. This gives it a notion of temporal memory, but plain RNNs are hard to train on long dependencies because gradients can vanish or explode.
- Elman RNNAn Elman RNN is the classic simple recurrent network in which the next hidden state is a nonlinear function of the current input and previous hidden state. It introduced the basic hidden-state recurrence used by later gated models, but long-range memory is poor without gating.
- Hidden stateThe hidden state is the internal representation a sequential model carries forward as it processes inputs over time. In an RNN or LSTM it summarizes relevant past context, and in broader neural architectures it usually means a layer's intermediate activation vector.
- What is the vanishing gradient problem?The vanishing gradient problem is the tendency for gradients propagated through many layers or time steps to shrink exponentially, making early parameters learn extremely slowly. It is especially severe in deep sigmoid or tanh networks and was a main motivation for LSTMs, better initialization, and residual connections.
- Embedding layerAn embedding layer maps discrete IDs such as words, subwords, or items to learned dense vectors. It is essential whenever symbolic inputs must be represented in a continuous space that gradient-based models can manipulate.
- Embedding vectorAn embedding vector is the dense continuous representation assigned to a discrete token, item, or entity by an embedding table or model. Its meaning comes from geometry: similar entities tend to occupy nearby directions or neighborhoods in the learned space.
- Word embeddingA word embedding is a dense vector representation of a word learned from distributional context rather than hand-coded features. Its purpose is to place semantically or syntactically related words near one another in vector space so downstream models can generalize across vocabulary items.
- Semantic similaritySemantic similarity is the degree to which two words, sentences, or documents share meaning rather than just surface form. In machine learning it is often estimated with embeddings and cosine similarity, which turns meaning comparison into a geometric problem.
- Cosine similarityCosine similarity measures the angle between two vectors: \( \cos \theta = x \cdot y / (\|x\| \|y\|) \). It ignores magnitude and compares direction, which is why it is the default similarity metric for embeddings in retrieval, clustering, and semantic search.
- Bag of wordsBag of words represents a document by counts or weights of vocabulary terms while discarding word order and syntax. It is simple, sparse, and historically central to information retrieval and document classification, but it cannot distinguish sentences with the same words in different orders.
- Document-term matrixA document-term matrix is a matrix whose rows are documents, columns are vocabulary terms, and entries are counts or weights such as TF-IDF. It is the core data structure behind bag-of-words retrieval, topic modeling, and many classical NLP pipelines.
- TF-IDFTF-IDF weights a term by how frequent it is in a document and how rare it is across the corpus, typically \( tf(w,d) \log(N/df(w)) \). It downweights ubiquitous words and highlights terms that are especially informative for a given document.
- SparsitySparsity means most entries in a vector, matrix, or parameter set are exactly zero. In ML it matters because sparse representations save memory and computation, and because sparsity-inducing penalties such as L1 can make models more interpretable.
- Dense vectorA dense vector is a low- or moderate-dimensional representation in which most entries are nonzero. Dense vectors are usually learned embeddings, so they capture semantic similarity better than sparse count vectors but are harder to interpret directly.
- Sparse vectorA sparse vector has very few nonzero entries relative to its dimensionality. Classical text features such as bag-of-words and TF-IDF are sparse, which makes them memory-efficient and interpretable even when the feature space is huge.
- One-hot encodingOne-hot encoding represents a categorical variable as a binary vector with exactly one 1 and all other entries 0. It preserves category identity without implying any ordering, but its dimensionality grows linearly with the number of categories.
- TokenizationTokenization is the process of splitting raw text into model-readable tokens such as words, subwords, bytes, or characters. It determines vocabulary size, sequence length, and how efficiently a language model handles rare words, multilingual text, and code.
- TokenA token is the discrete unit a language model reads and predicts. Depending on the tokenizer, a token may be a word, subword, byte, punctuation mark, or special control symbol, and token count determines both context usage and API cost.
- SubwordA subword is a token unit smaller than a full word but larger than a character, learned to balance vocabulary size against sequence length. Subword tokenization lets models handle rare and novel words by composing them from reusable pieces.
- VocabularyA vocabulary is the fixed set of tokens a tokenizer can map text into and a model can natively process. Its size trades off compression against flexibility: larger vocabularies shorten sequences, while smaller ones rely more on subword or byte composition.
- CorpusA corpus is a structured collection of text used to train, fine-tune, or evaluate language models. Its size, quality, domain mix, and cleaning decisions strongly shape what a model knows, how it generalizes, and which biases it inherits.
- N-gramAn n-gram is a contiguous sequence of \( n \) tokens, such as a bigram for \( n=2 \) or trigram for \( n=3 \). N-grams are the basic units of classical language models and many text features because they capture short-range local context.
- Count-based language modelA count-based language model estimates sequence probabilities from n-gram counts in a corpus, then uses smoothing or backoff for unseen events. It was the dominant pre-neural approach to language modeling, but it struggles with long context and data sparsity.
- Language modelA language model assigns probabilities to token sequences, or equivalently predicts missing or next tokens from context. This unifies classical n-gram models, masked models like BERT, and autoregressive LLMs such as GPT under one probabilistic framework.
- PerplexityPerplexity is the exponentiated average negative log-likelihood of a test sequence, so lower perplexity means the model is less surprised by the data. It is a standard intrinsic metric for language models, though low perplexity does not guarantee downstream usefulness.
- Log-likelihoodLog-likelihood is the logarithm of the probability a model assigns to the observed data under given parameter values. Taking logs turns products into sums, making estimation numerically stable and turning maximum likelihood into a tractable optimization problem.
- Negative log-likelihoodNegative log-likelihood is the loss obtained by negating the log-likelihood, so maximizing probability becomes minimizing a positive objective. It is the standard training loss for probabilistic classifiers, language models, and many generative models.
- Maximum likelihood estimate (MLE)The maximum likelihood estimate selects the parameter values that make the observed data most probable under the model. Many standard ML objectives, including cross-entropy for classification and next-token prediction for LLMs, are just MLE written as minimization of negative log-likelihood.
- Laplace smoothingLaplace smoothing adds a small constant, often 1, to every discrete count before normalizing probabilities. It prevents zero-probability events in models such as naive Bayes and n-gram LMs, though it can over-smooth when the vocabulary is large.
- Conditional probabilityConditional probability is the probability of an event after restricting attention to cases where another event is known to occur, written \( P(A \mid B) = P(A,B)/P(B) \). It is the basic object behind Bayes' rule, autoregressive models, and all context-dependent prediction.
- Discrete probability distributionA discrete probability distribution assigns nonnegative probabilities to a countable set of outcomes that sum to 1. Softmax outputs in classification and next-token prediction are discrete distributions over labels or vocabulary items.
- Zipf's lawZipf's law says a word's frequency is roughly inversely proportional to its rank in the frequency table. This heavy-tailed structure explains why a few tokens dominate corpora, why vocabularies keep growing with more data, and why tokenization and smoothing are central in NLP.
- Cross-entropyCross-entropy measures the average coding cost of samples from a true distribution \( p \) when encoded using a model distribution \( q \). In ML it is the standard loss for classification and language modeling, and minimizing it is equivalent to maximum likelihood up to an entropy constant.
- Binary cross-entropyBinary cross-entropy is the cross-entropy loss for a Bernoulli target, typically \( -[y\log \hat p + (1-y)\log(1-\hat p)] \). It is the standard loss for binary classification and for multi-label problems where each label is predicted independently.
- LogitA logit is the raw score before sigmoid or softmax normalization. In binary settings, the logit is also the log-odds \( \log\frac{p}{1-p} \), which is why linear models such as logistic regression operate naturally in logit space.
- ClassificationClassification is a supervised learning task in which the target is a discrete label rather than a continuous value. The model learns decision boundaries that separate classes, often outputting calibrated class probabilities as well as the predicted label.
- Binary classificationBinary classification is classification with exactly two classes, usually framed as predicting the probability of a positive class. It is commonly trained with logistic regression or a sigmoid output and binary cross-entropy loss.
- Multiclass classificationMulticlass classification assigns each input to exactly one of \( K>2 \) mutually exclusive classes. Models usually produce a softmax distribution over classes and train with cross-entropy against a one-hot or label-smoothed target.
- RegressionRegression is a supervised learning task where the target is continuous rather than categorical. The model predicts a numeric value, and common losses such as mean squared error correspond to assumptions about the noise model, especially Gaussian noise.
- Linear RegressionLinear regression models a target as an affine function of the inputs, typically \( y \approx w^\top x + b \), and fits the parameters by minimizing squared residuals. It is the canonical baseline for regression because it is interpretable and often has a closed-form OLS solution.
- Logistic RegressionLogistic regression is a linear classifier that models the log-odds of a class as \( w^\top x + b \) and maps that score through a sigmoid to get a probability. Despite its name, it is a classification model, not a regression model.
- AccuracyAccuracy is the fraction of predictions that are correct, \( (\text{TP}+\text{TN})/N \). It is intuitive and useful when classes are balanced, but it can be badly misleading on imbalanced datasets where always predicting the majority class already yields high accuracy.
- PrecisionPrecision is the fraction of predicted positives that are truly positive, \( \text{TP}/(\text{TP}+\text{FP}) \). It matters most when false positives are costly, such as spam filters, safety classifiers, or medical screening follow-ups.
- RecallRecall is the fraction of actual positives that the model successfully retrieves, \( \text{TP}/(\text{TP}+\text{FN}) \). It matters most when missing positives is costly, such as fraud detection, disease screening, or retrieval systems where relevant items should not be overlooked.
- F1 ScoreF1 score is the harmonic mean of precision and recall, \( 2PR/(P+R) \). It is high only when both precision and recall are high, making it useful for imbalanced classification where accuracy hides the trade-off between false positives and false negatives.
- Longest Common SubsequenceThe longest common subsequence is the longest sequence of symbols that appears in two sequences in the same order, not necessarily contiguously. It underlies edit-distance-style dynamic programming and metrics such as ROUGE-L because it captures shared sequence structure beyond exact n-gram matches.
- Edit DistanceEdit distance is the minimum number of insertions, deletions, and substitutions needed to transform one sequence into another. The most common version, Levenshtein distance, is a dynamic-programming measure of string similarity used in spelling correction, alignment, and evaluation.
- PerceptronThe perceptron is a linear threshold classifier that predicts a class from the sign of \( w^\top x + b \) and updates its weights only on mistakes. It is historically important because it introduced gradient-like learning for linear separators, but it only converges when the data are linearly separable.
- Decision TreeA decision tree predicts by recursively splitting the feature space with if-then tests until a leaf assigns a class or value. Trees are easy to interpret and capture nonlinearity, but a single deep tree has high variance and overfits without pruning or ensembling.
- Random ForestA random forest is an ensemble of decision trees trained on bootstrap samples with random feature subsetting at each split. Averaging many decorrelated trees greatly reduces variance, which is why random forests are strong tabular baselines with little tuning.
- Support Vector Machine (SVM)A support vector machine finds the decision boundary that maximizes the margin between classes, depending only on the support vectors nearest the boundary. With kernels, SVMs can model nonlinear separators while retaining a convex optimization objective.
- Principal Component Analysis (PCA)Principal component analysis finds orthogonal directions of maximal variance in the data and projects onto the top few of them. It is a linear dimensionality-reduction method that compresses data, denoises features, and reveals dominant global structure through eigenvectors of the covariance matrix.
- Dimensionality ReductionDimensionality reduction maps data into fewer dimensions while preserving as much important structure as possible, such as variance, distances, or neighborhood relations. It is used for compression, visualization, denoising, and making downstream learning easier in high-dimensional spaces.
- Self-AttentionSelf-attention lets each token compute a weighted combination of representations from other tokens in the same sequence, with weights determined by query-key similarity. It is the mechanism that gives Transformers flexible, content-dependent context mixing without recurrence.
- Attention ScoreAn attention score is the compatibility value computed between a query and a key before normalization, often by dot product or a learned variant. Higher scores mean the corresponding token or memory slot should receive more weight after the softmax.
- What is a scaled attention score?A scaled attention score is a query-key dot product divided by \( \sqrt{d_k} \) before softmax. The scaling keeps the variance of the logits from growing with key dimension, which helps prevent softmax saturation and keeps gradients well behaved.
- Attention WeightsAttention weights are the normalized coefficients, usually produced by a softmax over attention scores, that determine how much each value vector contributes to the output. They form a distribution over positions or memory entries for each query.
- Causal MaskA causal mask blocks attention to future positions by masking entries above the sequence diagonal. It enforces left-to-right autoregressive prediction, ensuring that token \( t \) can depend only on tokens \( \le t \).
- Attention HeadAn attention head is one parallel query-key-value attention computation inside multi-head attention. Different heads can specialize to different patterns, such as local syntax, long-range dependencies, or induction-like copying behavior.
- Query, Key, Value (QKV)Query, key, and value are the three learned projections used by attention: the query asks what to look for, the key says what each position offers, and the value is the content returned if that position is attended to. Attention weights come from query-key similarity, but outputs are weighted sums of values.
- Projection MatrixA projection matrix is a learned linear map that transforms vectors into another representation space. In Transformers, separate projection matrices create Q, K, and V from hidden states, and another projection maps concatenated head outputs back to the model dimension.
- Position-wise MLPA position-wise MLP is the feed-forward sublayer in a Transformer block, applied independently to each token after attention. It adds nonlinearity and channel mixing per token, complementing attention, which mixes information across positions.
- Residual Connection (Skip Connection)A residual connection adds a layer's input back to its output, so the layer learns a correction rather than an entirely new representation. This stabilizes optimization, improves gradient flow, and is one reason very deep networks and Transformers train reliably.
- Context WindowThe context window is the maximum number of tokens a model can process in one forward pass. It defines the model's accessible working memory at inference time, and longer windows increase both usefulness on long documents and computational cost.
- AutoregressionAutoregression is the factorization of a sequence distribution into a product of conditional next-step distributions. In language generation it means producing one token at a time, each conditioned on all previously generated tokens.
- PromptA prompt is the text or structured input given to a language model to condition its behavior and output. It can provide instructions, examples, retrieved context, or tool schemas, and in practice it acts as the model's temporary task specification.
- PretrainingPretraining is the large-scale first stage of training where a model learns general-purpose representations from unlabeled or self-supervised data. For LLMs this usually means next-token prediction over massive corpora, producing a base model that later fine-tuning can adapt.
- FinetuningFinetuning continues training a pretrained model on a smaller task-specific or domain-specific dataset. It adapts existing representations rather than learning from scratch, which is why it usually needs far less data and compute than pretraining.
- Greedy DecodingGreedy decoding always selects the highest-probability next token at each step. It is simple and deterministic, but it often gets trapped in bland or repetitive continuations because it never explores slightly less probable alternatives that might lead to better sequences.
- Model CompressionModel compression reduces a model’s memory, latency, or energy cost while trying to preserve performance. Common compression methods include distillation, pruning, quantization, low-rank factorization, and architecture redesign.
- QuantizationQuantization stores or computes with lower-precision numbers, such as INT8 or 4-bit values, instead of full-precision floats. It reduces memory bandwidth and can speed inference, but accuracy depends on how well the lower-precision representation preserves weights and activations.
- Bradley-Terry ModelThe Bradley-Terry model turns pairwise comparisons into latent scores by assuming the probability that item A beats item B depends on their score difference. It is widely used for preference modeling, ranking, and reward-model training from pairwise judgments.
- Pairwise ComparisonA pairwise comparison asks which of two items is better instead of assigning each item an absolute score. These judgments are often easier and more consistent for humans, which is why they are common in ranking, Elo-style systems, and alignment datasets.
- Elo RatingElo rating estimates skill from pairwise wins and losses by updating each participant’s score based on expected versus actual outcomes. It was designed for chess, but the same logic is used to aggregate model preferences and benchmark head-to-head evaluations.
- Bootstrap ResamplingBootstrap resampling estimates uncertainty by repeatedly sampling with replacement from an observed dataset and recomputing a statistic on each resample. It is useful when analytic uncertainty formulas are hard to derive, though it assumes the sample is reasonably representative.
- Confidence IntervalA confidence interval is a range produced by a procedure that would contain the true parameter a fixed fraction of the time over repeated samples, such as 95%. It quantifies estimation uncertainty, but it is not the probability that the parameter lies in this particular realized interval.
- Positional EncodingPositional encoding injects token order information into architectures like Transformers whose attention is otherwise permutation-invariant. It can be absolute or relative, and the choice strongly affects extrapolation, long-context behavior, and inductive bias.
- Absolute Position EncodingAbsolute position encoding assigns each sequence position its own encoding or embedding and combines it with token representations. It works well inside the trained context range, but it often extrapolates poorly because positions are treated as fixed IDs rather than relative distances.
- Token IDA token ID is the integer index assigned to a token after tokenization. Models do not operate on raw text directly; they look up embeddings from token IDs and later map output logits back to IDs during decoding.
- Vocabulary SizeVocabulary size is the number of distinct tokens a tokenizer can emit. A larger vocabulary shortens sequences but increases embedding and softmax size, while a smaller vocabulary produces longer sequences and more token fragmentation.
- Subword TokenizationSubword tokenization splits text into frequent pieces smaller than words but larger than individual characters. It handles rare words and open vocabularies well by composing unfamiliar words from known subword units.
- Special TokensSpecial tokens are reserved tokens with structural or control meaning, such as BOS, EOS, PAD, SEP, or mask tokens, rather than ordinary text content. They shape formatting, training objectives, and sometimes model behavior.
- Padding TokenA padding token is a dummy token added so sequences in a batch have equal length. It should be ignored by the loss and usually masked from attention so it does not behave like real context.
- BOS Token (Beginning of Sequence)A BOS token marks the beginning of a sequence and gives the model a consistent start symbol for conditioning generation or encoding. It can help define sequence boundaries and sometimes carries special training semantics.
- EOS Token (End of Sequence)An EOS token marks the end of a sequence and tells the model where generation should stop. During training it teaches sequence termination, and during inference it is one of the main stopping conditions.
- Sequence-to-Sequence (Seq2Seq)Sequence-to-sequence learning maps one sequence to another, often with different lengths, such as translation or summarization. Modern seq2seq models are usually encoder-decoder Transformers, though earlier versions used recurrent networks with attention.
- Distributed RepresentationA distributed representation stores a concept as a pattern across many features or neurons rather than in a single symbolic slot. This supports similarity, composition, and generalization because related concepts can occupy nearby regions of representation space.
- Representation LearningRepresentation learning is the process of learning useful features automatically from data rather than hand-engineering them. Good representations preserve the structure that downstream tasks need, such as semantic similarity, invariances, or factors of variation.
- Latent SpaceA latent space is the internal feature space in which a model represents inputs after transformation, often in a form that is more compact or task-relevant than raw data. Distances or directions in latent space can encode meaningful variation, but only relative to the model and objective that learned it.
- Embedding SpaceAn embedding space is the vector space produced by an embedding model, where tokens, sentences, images, or other objects are mapped to dense numerical representations. Similarity in that space is used for retrieval, clustering, and transfer, though the geometry depends on the training objective.
- BM25BM25 is a sparse retrieval scoring function that ranks documents using term matches weighted by inverse document frequency and document-length normalization. It remains strong for exact lexical search and is often combined with dense retrieval in hybrid systems.
- Sparse RetrievalSparse retrieval represents queries and documents with sparse term-based features such as inverted indexes, TF-IDF, or BM25. It excels at exact keywords and rare identifiers, but is weaker than dense retrieval on paraphrases and semantic matching.
- Adam OptimizerAdam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
- AdaGradAdaGrad adapts learning rates by dividing each parameter’s update by the square root of the accumulated historical squared gradients. It works especially well for sparse features, but its learning rates can decay too aggressively over long training runs.
- MomentumMomentum accumulates a running velocity of past gradients so updates keep moving in consistent directions and damp noisy zig-zags. It speeds optimization in ravines and is commonly paired with SGD or Nesterov variants.
- Learning Rate ScheduleA learning-rate schedule changes the learning rate over training instead of keeping it constant. Schedules matter because they balance fast early progress with stable late optimization and often determine final performance as much as the base optimizer.
- WarmupWarmup starts training with a small learning rate and gradually increases it during the first steps. It reduces early instability, especially in Transformers where large updates before optimizer statistics settle can cause divergence.
- Weight InitializationWeight initialization chooses starting parameter values before training begins. Good initialization keeps activations and gradients in useful ranges so learning can start without vanishing, exploding, or breaking symmetry.
- Tokenization PipelineA tokenization pipeline is the full process that turns raw text into model-ready inputs, including normalization, pre-tokenization, subword splitting, token-to-ID mapping, truncation, padding, and special-token insertion. Choices here directly affect sequence length, vocabulary coverage, and downstream behavior.
- GPU AccelerationGPU acceleration uses highly parallel graphics processors to speed the matrix and tensor operations that dominate modern ML workloads. It matters because deep learning is mostly throughput-bound linear algebra, which GPUs execute far more efficiently than general-purpose CPUs.
- CUDACUDA is NVIDIA’s parallel-computing platform and programming model for running general-purpose kernels on GPUs. In machine learning it is the software layer that makes GPU-accelerated training and inference practical, exposing massive parallelism, specialized libraries, and direct control over device memory.
- Benchmark (ML Evaluation)A benchmark in ML evaluation is a standardized task, dataset, metric, and protocol used to compare systems reproducibly. Benchmarks are useful because they make progress measurable, but they can be gamed, saturated, or misaligned with real-world performance.
- Human EvaluationHuman evaluation uses people to judge outputs on qualities such as helpfulness, factuality, coherence, or safety that automated metrics often miss. It is usually the most trustworthy evaluation for subjective tasks, but it is expensive, slow, and sensitive to rubric design and annotator variance.
- Automated EvaluationAutomated evaluation scores model outputs with metrics or model-based judges instead of human raters. It is fast, scalable, and reproducible, but its usefulness depends on how well the metric correlates with the human judgment that actually matters.
- What is catastrophic forgetting?Catastrophic forgetting is the sharp loss of performance on previously learned tasks after a model is trained on new ones. It happens because gradient updates that help the new task can overwrite internal representations that were supporting the old task.
- Pretraining CorpusA pretraining corpus is the large unlabeled dataset used to train a model’s base capabilities through self-supervised objectives such as next-token prediction. Its size, quality, duplication rate, domain mix, and filtering choices strongly shape what the model knows and how it behaves.
- Reward SignalA reward signal is the scalar feedback an RL agent receives about the desirability of its behavior. Because the agent optimizes whatever reward it is given, the design of the reward signal determines whether learning produces the intended behavior or merely exploits a proxy.
- Policy (Reinforcement Learning)In reinforcement learning, a policy is the rule that maps states or observations to actions, often as a probability distribution. Learning a policy means directly improving behavior, and in language-model RL the policy is the model’s distribution over tokens or completions conditioned on context.
- AI SafetyAI safety is the broader field concerned with preventing harmful or catastrophic outcomes from advanced AI systems. It includes alignment, robustness, misuse prevention, monitoring, control, and governance, so it is wider than just making a chatbot refuse bad requests.
- Alignment (AI)Alignment in AI is the problem of making an AI system’s objectives and behavior match human intentions and values rather than a flawed proxy. The hard part is not only teaching what humans say they want, but ensuring the system pursues that goal robustly in new situations.
- InterpretabilityInterpretability is the study of making model behavior understandable to humans, whether by explaining predictions, revealing learned features, or analyzing internal structure. It matters because debugging, trust, scientific understanding, and safety all depend on seeing more than just inputs and outputs.
- Foundation ModelA foundation model is a large general-purpose model pretrained on broad data and then adapted to many downstream uses through prompting, fine-tuning, or tool use. Its defining property is transfer: one base model can support many tasks rather than being built for just one.
- Scaling LawsScaling laws are empirical relationships showing how loss or capability changes with model size, data, and compute, often following approximate power laws. They matter because they let researchers forecast returns to scale and choose more compute-efficient training regimes.
- Transfer LearningTransfer learning reuses knowledge learned on one task or dataset to improve performance on another. It is effective because useful features learned in a high-resource setting often remain useful in a lower-resource target domain.
- Data AugmentationData augmentation expands a training set with label-preserving transformations such as crops, paraphrases, or noise injection. It improves generalization by teaching the model which variations should not change the answer.
- Softmax HeadA softmax head is the output projection plus softmax normalization that converts hidden representations into a probability distribution over classes or vocabulary items. In language models it is the layer that turns the final hidden state into next-token probabilities.
- Beam SearchBeam search is a decoding algorithm that keeps the top-scoring partial sequences at each step instead of only the single best one. It approximates high-probability generation better than greedy decoding, but it can still miss the global optimum and often reduces diversity.
- Encoder (Transformer)A Transformer encoder is a stack of self-attention and feed-forward blocks that builds contextual representations of an input sequence. Because encoder self-attention is usually bidirectional, it is well suited for understanding tasks such as classification, retrieval, and sequence labeling.
- Decoder (Transformer)A Transformer decoder is the autoregressive half of the architecture that predicts tokens using causal self-attention and, in encoder-decoder models, optional cross-attention to an encoder output. Its defining constraint is that each position can attend only to earlier positions when generating.
- Neural Language ModelA neural language model predicts text with learned distributed representations and a neural network rather than count tables. Its main advantage over classical n-gram models is that it can generalize to unseen contexts by sharing statistical strength across similar words and patterns.
- Semantic SpaceA semantic space is an embedding space in which geometric relations reflect meaning, similarity, or functional role. Nearby points tend to correspond to semantically related items, which is why vector search and representation learning work at all.
- Softmax NormalizationSoftmax normalization converts a vector of logits into a probability distribution by exponentiating each score and dividing by the total. It preserves rank order while making outputs comparable, which is why it is the standard final normalization for multiclass prediction.
- Over-ParameterizationOver-parameterization means a model has far more parameters than the minimal number apparently needed to fit the data. Counterintuitively, this often helps optimization and can still generalize well because training dynamics and implicit regularization matter as much as raw parameter count.
- Conversational AIConversational AI is a class of systems designed for multi-turn interaction, where the model must respond helpfully while tracking context, intent, and dialogue state. The hard part is not generating one good answer, but remaining coherent and useful across an extended interaction.
- Bias MitigationBias mitigation is the set of methods used to reduce unfair or systematically skewed behavior in models and datasets. It can act before training, during optimization, or after deployment, but every intervention trades off fairness goals, accuracy, and operational complexity.
- Transparency (AI Systems)Transparency in AI systems means making system behavior, limitations, provenance, and decision pathways inspectable to users, developers, or regulators. It is broader than interpretability because it includes documentation, reporting, and operational visibility, not just internal model analysis.
- Log-Sum-Exp TrickThe log-sum-exp trick computes expressions like log(sum(exp(x_i))) stably by subtracting the maximum logit before exponentiation. It prevents overflow and underflow, so it is a standard numerical tool in softmax, cross-entropy, and probabilistic inference.
- Exponential Family of DistributionsThe exponential family is the class of distributions that can be written in the form exp(eta^T T(x) - A(eta) + c(x)). This shared form gives them sufficient statistics, convenient conjugate priors, and clean maximum-likelihood geometry, which is why they dominate classical statistical modeling.
- KL DivergenceKL divergence measures how one probability distribution differs from a reference distribution through an expected log-ratio. It is nonnegative and asymmetric, so it is best understood not as a distance but as the penalty for modeling samples from one distribution with another.
- Jensen's InequalityJensen’s inequality says that for a convex function f, applying f after taking an expectation gives a value no larger than taking the expectation after applying f. This one fact underlies many core results in ML, including the nonnegativity of KL divergence and the derivation of variational lower bounds.
- Bayes' TheoremBayes’ theorem updates beliefs by combining a prior with the likelihood of observed evidence to produce a posterior. In compact form, posterior is proportional to likelihood times prior, which is why Bayesian inference is fundamentally a rule for disciplined belief revision.
- Eigenvalues and EigenvectorsFor a matrix A, an eigenvector is a nonzero direction that A only rescales, and the scaling factor is its eigenvalue. Eigenpairs matter because they reveal invariant directions, control stability, and make problems like PCA and spectral clustering possible.
- Jacobian and HessianThe Jacobian collects first-order partial derivatives of a vector-valued function, while the Hessian collects second-order partial derivatives of a scalar function. Together they describe local sensitivity and curvature, which is why they are central to optimization and dynamical analysis.
- Multivariate Gaussian DistributionThe multivariate Gaussian is the vector-valued generalization of the normal distribution, parameterized by a mean vector and covariance matrix. It is foundational because linear transformations, marginals, and conditionals all stay Gaussian, making analysis and inference unusually tractable.
- Naive Bayes ClassifierNaive Bayes is a probabilistic classifier that applies Bayes’ theorem under the simplifying assumption that features are conditionally independent given the class. That assumption is usually false, but the model is still fast, data-efficient, and surprisingly effective for sparse text problems.
- k-Nearest Neighbors (k-NN)k-nearest neighbors predicts by finding the k closest labeled training points to a query and then voting or averaging their labels. It has almost no training phase, but its predictions depend heavily on the distance metric and it degrades in high-dimensional spaces.
- Universal Approximation TheoremThe universal approximation theorem says that a sufficiently wide neural network with a suitable nonlinearity can approximate any continuous function on a compact domain arbitrarily well. It is an existence result, not a guarantee that training will find that approximation efficiently.
- Chain Rule of ProbabilityThe factorisation \( p(x_1, \dots, x_n) = \prod_{i=1}^n p(x_i \mid x_{<i}) \) that decomposes any joint distribution into a product of conditionals. It is the mathematical bedrock of autoregressive language models, belief networks, and most tractable density estimation.
- Central Limit TheoremGiven i.i.d. samples \( X_1, \dots, X_n \) with finite mean \( \mu \) and variance \( \sigma^2 \), the standardised sample mean \( \sqrt{n}(\bar X_n - \mu)/\sigma \) converges in distribution to \( \mathcal{N}(0,1) \). The CLT underlies confidence intervals, stochastic-gradient noise analysis, and many initialisation arguments.
- Variance, Covariance and CorrelationVariance measures the spread of a single random variable; covariance measures joint variation of two; correlation normalises covariance to \( [-1, 1] \) and is scale-free. Together they form the second-order statistics that drive PCA, linear regression, Kalman filters, and most initialisation schemes.
- Taylor Series ExpansionApproximates a smooth function near a point \( a \) by a polynomial whose coefficients are the function's derivatives: \( f(x) \approx \sum_{k=0}^{K} \frac{f^{(k)}(a)}{k!}(x-a)^k \). Provides the theoretical scaffolding for gradient descent (1st order), Newton's method (2nd order), Laplace approximations, and loss-landscape analysis.
- Entropy vs Cross-Entropy vs KL (Unified View)A single identity \( H(p,q) = H(p) + D_{\text{KL}}(p \| q) \) ties the three together: entropy is the optimal code length under \( p \); cross-entropy is the code length using \( q \); KL is the extra cost of the mismatch. This view clarifies why minimising classification cross-entropy is equivalent to MLE and to minimising KL.
- Matrix Rank and the Rank–Nullity TheoremThe rank of \( A \in \mathbb{R}^{m \times n} \) is the dimension of its column space, equal to the dimension of its row space. The rank–nullity theorem states \( \text{rank}(A) + \text{nullity}(A) = n \), linking how much a linear map preserves to how much it collapses — the foundation for invertibility conditions, low-rank adaptation, and OLS identifiability.
- Matrix Norms: Frobenius, Spectral, NuclearThe three dominant matrix norms measure different notions of size: Frobenius \( \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2} \) is the entrywise \( \ell_2 \); spectral (operator) \( \|A\|_2 = \sigma_{\max}(A) \) is the largest gain on unit vectors; nuclear \( \|A\|_* = \sum_i \sigma_i(A) \) is the convex surrogate for rank. Each appears in a distinct ML context: regularisation, Lipschitz bounds, and low-rank recovery respectively.
- Positive (Semi-)Definite MatricesA symmetric matrix \( A \in \mathbb{R}^{n \times n} \) is positive semi-definite (PSD) if \( x^\top A x \ge 0 \) for all \( x \), and positive definite (PD) if strict for \( x \ne 0 \). Equivalent characterisations include non-negative eigenvalues and a Cholesky factorisation \( A = L L^\top \). PSD structure underlies covariance matrices, Gram matrices, convex Hessians, and every kernel method.
- Random Variables, Expectation, and Variance (Axiomatic)A random variable \( X \) is a measurable map from a probability space \( (\Omega, \mathcal{F}, P) \) to \( \mathbb{R} \). Its expectation \( \mathbb{E}[X] = \int X\,dP \) and variance \( \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \) are the first two moments. Beyond textbook formulas, this axiomatic view explains why expectations linearise sums, exist iff \( \mathbb{E}|X| < \infty \), and commute with limits under uniform integrability.
- Bernoulli, Binomial, Categorical, and Multinomial DistributionsThe four atomic discrete distributions: Bernoulli \( (p) \) for a single binary trial, Binomial \( (n, p) \) for a sum of \( n \) such trials, Categorical \( (\pi) \) for a single \( K \)-class outcome (softmax target), and Multinomial \( (n, \pi) \) for counts across \( n \) such trials. They form the likelihood backbone of logistic regression, cross-entropy training, and count models.
- Law of Total ProbabilityThe law of total probability computes an event probability by summing over mutually exclusive, exhaustive cases. In machine learning it is the basic marginalization identity behind latent-variable models, mixture models, and many Bayesian calculations.
- Conditional IndependenceConditional independence means two variables become unrelated once a third variable is known. It is the simplifying assumption that makes graphical models tractable and explains why conditioning can either remove dependence or, in collider structures, create it.
- Empirical Risk Minimization (ERM)Empirical risk minimization chooses the model with the smallest average training loss. It is the default principle behind most supervised learning, but it must be paired with capacity control or held-out evaluation because low training loss alone does not guarantee generalization.
- Train/Validation/Test SplitA train/validation/test split separates fitting, model selection, and final evaluation into different datasets. The test set is kept untouched until the end so it remains a credible estimate of out-of-sample performance.
- k-Fold Cross-Validationk-fold cross-validation rotates a held-out fold through the dataset so every example is used for validation once and training the other times. It uses limited data efficiently for model selection, but it costs multiple training runs and must keep preprocessing inside each fold.
- Confusion MatrixA confusion matrix counts predicted labels against true labels. In binary classification it yields the four basic counts—true positives, false positives, true negatives, and false negatives—from which most common thresholded metrics are derived.
- Feature Scaling and StandardizationFeature scaling rescales input dimensions to comparable magnitudes, while standardization specifically subtracts the training mean and divides by the training standard deviation. It matters because optimization, distances, and margins can otherwise be dominated by whichever feature uses the largest units.
- Markov Decision Process (MDP)A Markov decision process formalizes sequential decision-making with states, actions, transitions, rewards, and a discount factor. Its key assumption is that the next-state and reward distribution depends only on the current state and action, which makes Bellman-style planning possible.
- AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.
- Brier ScoreThe Brier score measures the mean squared error of probabilistic predictions, so it rewards both correctness and calibration. Lower is better, and unlike accuracy it penalizes a confidently wrong 0.99 prediction much more than a cautious 0.6 prediction.
- Anomaly DetectionAnomaly detection identifies observations that look unlikely under the pattern of normal data. The main families are density-based methods, reconstruction-based methods, and one-class classification methods, and the right choice depends on whether you have labels, strong feature engineering, or only normal examples.
- Target Leakage vs. Data LeakageData leakage is any contamination that lets training or validation use information that would not be available at prediction time. Target leakage is the specific case where features encode the label or a post-outcome proxy for it, so every target leakage problem is data leakage, but not every data leakage problem is target leakage.
- Backpropagation — History (Werbos → Rumelhart/Hinton/Williams)The history of backpropagation is the story of an idea known in pieces before it became a practical neural-network training method. Werbos articulated reverse-mode differentiation for network training in the 1970s, and Rumelhart, Hinton, and Williams turned it into the landmark 1986 demonstration that made multilayer neural networks trainable in practice.