Tag: intermediate

268 topic(s)

Layer Dropping and Progressive Pruning (TrimLLM)Layer dropping and progressive pruning reduce inference cost by cutting transformer depth rather than shrinking every matrix. TrimLLM does this progressively for domain-specialized LLMs, exploiting the empirical fact that not all layers are equally important in a target domain and aiming to retain in-domain accuracy while reducing latency.
Test-Time Compute ScalingTest-time compute scaling improves a model by spending extra computation at inference time, for example through search, verification, reranking, or adaptive refinement, instead of only scaling pretraining. It is most useful on prompts where the base model already has some chance of success, because additional compute can then amplify that success more efficiently than a much larger one-shot model.
GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
ELMo (Embeddings from Language Models)ELMo produces contextualized word embeddings by taking a learned task-specific combination of hidden states from a pretrained bidirectional LSTM language model. Unlike static embeddings such as word2vec or GloVe, it gives the same word different vectors in different sentence contexts.
Key-Value Memory NetworksKey-Value Memory Networks store each memory slot as a key for retrieval and a separate value for the returned content. This decouples matching from payload and is a direct conceptual precursor to modern query-key-value attention.
Luong Attention (Global and Local)Luong attention is a sequence-to-sequence attention mechanism that scores decoder states against encoder states using multiplicative forms such as dot or bilinear attention. It distinguishes global attention over all source positions from local attention over a predicted window, helping make neural machine translation more scalable.
Weight TyingWeight tying uses the same matrix for token embeddings and the output softmax projection, typically by setting the output weights to the transpose of the input embedding table. This cuts parameters and often improves language modeling by forcing input and output token representations to share geometry.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
AdamW OptimizerAdamW is Adam with decoupled weight decay: parameter shrinkage is applied directly to the weights instead of being mixed into the adaptive gradient update. This preserves the intended regularization effect and is why AdamW became the default optimizer for many Transformer models.
SwiGLU Activation FunctionSwiGLU is a gated feed-forward activation that multiplies one linear projection by a Swish-activated gate from another projection. It usually performs better than standard ReLU-style MLP blocks at similar scale, which is why many modern LLMs use it in their feed-forward layers.
Pre-Norm vs. Post-Norm ArchitecturePre-Norm vs. Post-Norm is the choice of whether layer normalization is applied before or after each residual sublayer in a Transformer block. Pre-Norm usually trains deeper stacks more stably by preserving gradient flow through the residual path, while Post-Norm was the original design and can be less stable at scale.
Key-Value (KV) CachingKey-value caching stores the attention keys and values from earlier tokens during autoregressive decoding so they do not need to be recomputed at every step. It speeds up generation dramatically, but the cache grows with sequence length and turns inference into a memory-management problem.
Rotary Positional Embedding (RoPE)Rotary Positional Embedding encodes position by rotating query and key vectors with token-index-dependent angles before attention is computed. Because the resulting dot products depend on relative offsets, RoPE gives Transformers a simple and widely used way to represent order.
Message Passing in Graph Neural Networks (GNNs)Message passing in graph neural networks updates each node by aggregating transformed information from its neighbors and combining it with the node's current representation. After K rounds, a node's state depends on its K-hop neighborhood, which is why message passing is the core operation of most spatial GNNs.
Triplet Margin LossTriplet margin loss trains an embedding space so an anchor is closer to a positive example than to a negative example by at least a fixed margin. It is a standard metric-learning objective because it directly enforces relative similarity rather than predicting class labels.
Hidden Markov Model (HMM)A Hidden Markov Model is a sequence model with an unobserved Markov chain of states and an observed emission distribution from each state. It became a standard model for speech, tagging, and other structured sequence tasks because dynamic programming can efficiently infer likely states and sequence probabilities.
Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo samples from a difficult target distribution by constructing a Markov chain whose stationary distribution matches that target. It is essential in Bayesian inference because it replaces intractable posterior integrals with averages over samples, provided the chain mixes well enough.
Reparameterization Trick (VAE)The reparameterization trick writes a stochastic latent sample as a differentiable transformation of parameters and noise, typically z equals mu plus sigma times epsilon. This lets gradients flow through sampling and makes variational autoencoder training practical with backpropagation.
AutoencodersAutoencoders are neural networks trained to reconstruct their inputs after passing them through a compressed or otherwise constrained latent representation. They are useful because the bottleneck forces the model to learn structure in the data rather than just memorize an identity map.
GAN Minimax ObjectiveThe GAN minimax objective sets up a two-player game in which a generator tries to produce samples that fool a discriminator, while the discriminator tries to distinguish real from generated data. At equilibrium the generator matches the data distribution, though the training game is often unstable in practice.
Q-LearningQ-learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function by bootstrapping from a Bellman target over the best next action. Because its update does not require following the current policy, it became a foundational method in both tabular RL and DQN-style deep RL.
Bootstrap Aggregating (Bagging)Bootstrap Aggregating trains multiple models on bootstrap-resampled versions of the training set and averages their predictions to reduce variance. It helps most with unstable base learners such as decision trees, which is why it underlies random forests.
Gradient Boosting Machines (GBM)Gradient Boosting Machines build an additive model by fitting each new weak learner to the negative gradient of the current loss. In practice each stage focuses on correcting the remaining errors of the ensemble, which makes boosting powerful but sensitive to overfitting if trees and learning rates are not controlled.
AdaBoost (Adaptive Boosting)AdaBoost builds an ensemble by repeatedly fitting weak learners to reweighted data so that previously misclassified examples receive more attention. Its final predictor is a weighted vote of the learners, and its power comes from turning many slightly better-than-random classifiers into a strong one.
Backpropagation through time (BPTT)Backpropagation through time trains a recurrent network by unrolling it across sequence steps and applying backpropagation to the resulting deep computational graph. It exposes how earlier states influence later losses, but long unrolls make optimization and memory use difficult.
Long short-term memory (LSTM)Long short-term memory is a gated recurrent architecture designed to preserve information over long timescales. Its input, forget, and output gates regulate a cell state with near-linear self-connections, which helps prevent the vanishing-gradient behavior of simple RNNs.
Gated recurrent unit (GRU)A gated recurrent unit is a recurrent architecture that uses update and reset gates to control how much past information is kept and how much new input is written into the hidden state. It is simpler than an LSTM because it has no separate cell state, yet it often achieves similar sequence-modeling performance.
Word2VecWord2Vec is a family of shallow neural methods that learn word embeddings from local context, most famously via the skip-gram and CBOW objectives. Its importance is that simple predictive training on large text corpora produced useful semantic geometry, including analogy-like linear regularities.
Skip-gramSkip-gram trains a model to predict surrounding context words from a center word. It learns embeddings that are especially good for capturing rare-word semantics because each observed word directly becomes a prediction source for many context targets.
FastTextFastText extends Word2Vec by representing a word as a bag of character n-gram embeddings rather than as a single atomic vector. That lets it model morphology and produce reasonable embeddings for rare or even unseen words.
Autoregressive language modelAn autoregressive language model generates text left-to-right by modeling \( P(w_t \mid w_{<t}) \) for each token. Because it only conditions on past tokens, it can be used directly for open-ended generation as well as scoring sequences.
Masked language modelA masked language model is trained to recover tokens hidden within a sequence using both left and right context. This bidirectional training makes MLMs strong encoders for understanding tasks, but less natural than autoregressive models for direct generation.
Causal language modelA causal language model predicts each token using only earlier tokens, enforced by a causal attention mask. It is essentially the same modeling family as an autoregressive language model, with the word 'causal' emphasizing the masking constraint in self-attention.
Chat language modelA chat language model is a pretrained LLM further tuned to follow instructions and handle multi-turn dialogue. It is usually built by supervised fine-tuning plus preference optimization or RLHF, so it behaves more helpfully and safely than the raw base model.
Backoff (N-gram backoff)Backoff is an n-gram smoothing strategy that uses a high-order estimate when it has enough evidence and otherwise falls back to a lower-order n-gram. It handles sparsity by preferring specific context when available without assigning zero probability to unseen sequences.
ROUGEROUGE is a family of overlap metrics for summarization and generation, based on matching n-grams, longest common subsequences, or skip-bigrams between a candidate and reference text. It measures lexical recall more than semantic faithfulness, so it is informative but limited.
Kernel MethodsKernel methods turn linear algorithms into nonlinear ones by replacing inner products with a kernel function that implicitly measures similarity in a higher-dimensional feature space. This is the core trick behind SVMs, kernel ridge regression, and Gaussian processes.
Decoder BlockA decoder block is the basic unit of a decoder-only Transformer: causal self-attention plus a position-wise MLP, wrapped with residual connections and normalization. Stacking these blocks lets the model mix context across tokens while preserving autoregressive generation.
Decoder-only TransformerA decoder-only Transformer is a Transformer architecture composed only of masked self-attention blocks, so each token can attend only to earlier tokens. This makes it the standard architecture for autoregressive language models such as GPT, LLaMA, and Claude.
Masked Attention ScoreA masked attention score is an attention logit after adding a mask that blocks forbidden positions, typically by adding a very large negative value before softmax. This forces the resulting attention weight to be effectively zero at those positions.
Attention MaskAn attention mask is a tensor that tells an attention layer which positions may interact and which must be blocked. It is used for causal generation, padding suppression, and task-specific visibility patterns, and it must be applied before softmax, not after.
Multi-Head AttentionMulti-head attention runs several attention mechanisms in parallel on different learned projections of the same input, then concatenates their outputs. This lets the model capture multiple relational patterns at once instead of forcing all interactions through a single attention map.
Root Mean Square Normalization (RMSNorm)RMSNorm normalizes activations by their root mean square without subtracting the mean. Compared with LayerNorm it is slightly cheaper and often just as effective, which is why many modern LLMs use RMSNorm in place of full mean-and-variance normalization.
System PromptA system prompt is a high-priority instruction block that defines the assistant's role, rules, and behavioral constraints for a conversation. It is usually prepended invisibly to user messages and is intended to override lower-priority prompt content.
Prompting FormatPrompting format is the template used to serialize instructions, roles, examples, and conversation turns into the token sequence a model expects. It matters because the same words in a different format can change model behavior, especially for chat-tuned systems.
Few-Shot PromptingFew-shot prompting includes a small number of labeled examples in the prompt so the model can infer the task from context without updating parameters. It is one of the clearest demonstrations of in-context learning in large language models.
In-Context LearningIn-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
Prompt EngineeringPrompt engineering is the practice of designing prompts that make a model reliably produce the desired behavior. It includes choosing instructions, examples, structure, and reasoning scaffolds, and it trades parameter updates for careful interface design.
Chain of ThoughtChain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
Self-ConsistencySelf-consistency samples multiple reasoning traces for the same problem and chooses the most common final answer rather than trusting a single chain of thought. It often boosts accuracy because different samples make different mistakes, while the correct answer tends to recur.
ReAct (Reason + Act)ReAct is a prompting pattern where a model alternates between reasoning in text and taking actions such as search or tool calls. This lets it use external information and observations to update its plan instead of reasoning only from the original prompt.
Function CallingFunction calling is a language-model capability for producing structured tool invocations instead of only plain text. The model selects a function and arguments that match a schema, which makes tool use more reliable and easier to integrate with software systems.
Large Language Model (LLM)A large language model is a very large neural language model, usually with billions of parameters, pretrained on massive text corpora. Scale gives LLMs broad world knowledge and emergent capabilities such as in-context learning, but the core training objective is still language modeling.
Supervised Fine-Tuning (SFT)Supervised fine-tuning trains a pretrained model on curated input-output pairs so it follows instructions, styles, or task formats more reliably. In chat systems, SFT is the stage that turns a raw completion model into an assistant before preference alignment is applied.
Full Fine-TuneA full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
Parameter-Efficient Fine-Tuning (PEFT)PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
Low-Rank Adaptation (LoRA)LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
LoRA AdapterA LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
Base ModelA base model is the pretrained model before instruction tuning, chat alignment, or task-specific fine-tuning. It is usually optimized only for language modeling, so it can complete text well but may not reliably follow user instructions or safety constraints.
Open-Weight ModelAn open-weight model is a model whose trained weights are publicly released for download and local use. That is more specific than 'open source': the weights may be open even when the training data, code, or full recipe are not.
Sampling (in Language Models)Sampling in language models means selecting the next token from the predicted probability distribution instead of always taking the argmax. The decoding rule strongly shapes diversity, coherence, and repetition, which is why temperature, top-k, and top-p matter so much.
Top-k SamplingTop-k sampling truncates the next-token distribution to the \( k \) most probable tokens, renormalizes, and samples from that set. It removes the low-probability tail that often contains junk while still allowing controlled randomness.
Top-p Sampling (Nucleus Sampling)Top-p, or nucleus, sampling chooses the smallest set of tokens whose cumulative probability exceeds a threshold \( p \), then samples from that adaptive set. Unlike top-k, it expands when the model is uncertain and shrinks when the distribution is sharp.
Frequency PenaltyA frequency penalty subtracts an amount proportional to how many times a token has already appeared, lowering its future logit more with each repetition. It encourages lexical diversity without banning reuse entirely, which makes it gentler than hard repetition constraints.
Presence PenaltyA presence penalty lowers the score of any token that has already appeared, encouraging the model to introduce new words or topics instead of repeating earlier ones. Unlike a frequency penalty, it depends only on whether the token has appeared at least once, not on how many times it appeared.
HallucinationHallucination is when a model produces content that is unsupported or false while presenting it as if it were correct. In language models it often comes from next-token training, weak grounding, or overconfident decoding rather than deliberate deception.
Bias (Fairness)In fairness contexts, bias means systematic differences in treatment or error rates across groups caused by data, labels, measurement, or deployment choices. Fairness asks which notion of equal treatment matters, and different fairness criteria often cannot all be satisfied at once.
ExplainabilityExplainability is the ability to give a human-understandable reason for a model’s prediction or behavior using features, examples, rules, or mechanisms. A good explanation should be useful to a person and, ideally, faithful to what the model actually used.
Retrieval-Augmented Generation (RAG)Retrieval-augmented generation adds a retrieval step so the model conditions on external documents at inference time instead of relying only on memorized parameters. It can improve freshness and grounding, but answer quality depends heavily on retrieval recall, ranking, chunking, and how well the model uses the retrieved evidence.
Semantic SearchSemantic search retrieves results by meaning rather than exact keyword overlap, usually by embedding queries and documents into a vector space and comparing similarity. It handles paraphrases well, but it is often combined with lexical search when exact terms or identifiers matter.
Model MergingModel merging combines the weights or weight deltas of separately fine-tuned models into one model without full retraining. It can create multitask behavior cheaply, but naive averaging often causes interference unless the merged models share a common base and compatible parameter geometry.
Model SoupsModel soups average the weights of multiple fine-tuned models that lie in the same low-loss basin, often improving accuracy and robustness without extra inference cost. Unlike an ensemble, a soup is still a single model at serving time.
SLERP (Spherical Linear Interpolation)SLERP interpolates between two parameter vectors along the great-circle path on a sphere instead of using straight-line interpolation. In model merging it can preserve vector norms and sometimes produce smoother blends than linear interpolation when the two directions differ strongly.
TIES-MergingTIES-Merging is a model-merging method that reduces interference by trimming small parameter changes, resolving sign conflicts, and merging only updates aligned with an agreed sign. It is designed for cases where separately fine-tuned models disagree on which weights should move and in what direction.
DARE (Drop And REscale)DARE sparsifies fine-tuning deltas by randomly dropping many update entries and rescaling the rest, preserving most behavior while greatly reducing storage and merge interference. It is commonly used as a preprocessing step for delta compression or model merging rather than as a standalone training method.
Task VectorA task vector is the weight difference between a pre-trained model and the same model after fine-tuning on a task. Adding, subtracting, or scaling that vector can steer behavior, so task vectors provide a simple weight-space tool for editing or combining capabilities.
Knowledge DistillationKnowledge distillation trains a smaller student model to match the outputs or internal representations of a larger teacher. It transfers some of the teacher’s behavior into a cheaper model, often using soft targets that contain more information than hard labels alone.
PruningPruning removes weights, neurons, heads, or entire blocks that contribute little to performance. It can reduce model size or compute, but aggressive pruning usually needs fine-tuning to recover lost accuracy.
Structured PruningStructured pruning removes whole channels, heads, layers, or blocks, producing regular sparsity that hardware can exploit directly. It usually yields better real-world speedups than unstructured pruning, though it gives less fine-grained control.
Unstructured PruningUnstructured pruning zeros individual weights wherever they appear unimportant, creating irregular sparsity patterns. It can achieve high compression ratios, but specialized kernels are often needed to turn that sparsity into real latency gains.
Post-Training Quantization (PTQ)Post-training quantization converts a trained model to lower precision after optimization is finished, usually using a calibration set to estimate activation ranges. It is easy and cheap to apply, but accuracy can drop more than with quantization-aware training at very low bit widths.
Quantization-Aware Training (QAT)Quantization-aware training simulates low-precision arithmetic during training so the model learns weights that remain accurate after quantization. It usually outperforms post-training quantization at low bit widths, but it adds training cost and implementation complexity.
Reward ModelA reward model predicts a scalar preference score for a candidate response, usually from pairwise human comparisons. In RLHF it acts as a learned proxy objective, so the policy can exploit its mistakes if optimization pushes too hard against it.
Self-CritiqueSelf-critique is a prompting or training pattern where a model reviews its own draft, identifies problems, and then revises the answer. It can improve reasoning and safety, but only when the model can recognize errors more reliably than it makes them.
RankingRanking is the task of ordering items by relevance, preference, or utility rather than predicting a single class label. It appears in search, recommendation, and alignment because the main question is which outputs should be placed above others.
Pairwise RankingPairwise ranking learns an ordering from relative preferences between pairs rather than from absolute target values. Many ranking losses optimize the probability that preferred items score above rejected ones, which fits search and alignment data naturally.
Cross-AttentionCross-attention lets one sequence or modality attend to representations produced by another sequence or modality. In encoder-decoder models the decoder queries encoder states, and in multimodal models text tokens often query visual features the same way.
Vision EncoderA vision encoder maps an image into features or tokens that downstream modules can use for classification, retrieval, or generation. CNNs and Vision Transformers are common vision encoders, differing mainly in how they represent spatial structure.
CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
Program-Aided Language ModelA program-aided language model uses the LLM to translate a problem into executable code, then lets an interpreter carry out the exact computation. This separates natural-language understanding from symbolic execution and often improves arithmetic or algorithmic reasoning over pure chain-of-thought.
Data ParallelismData parallelism replicates the model on multiple devices and splits each batch across them, synchronizing gradients after each step. It is the simplest way to scale training throughput, but every device still stores the full model unless sharding is added.
Model ParallelismModel parallelism splits one model across multiple devices because it is too large or compute-heavy for a single device. The split can happen by layers, tensors, experts, or sequence chunks, trading memory savings for extra communication.
Model ShardingModel sharding splits a model’s parameters across devices or storage tiers instead of keeping a full copy everywhere. It is a general systems technique used in tensor parallelism, FSDP, offloading, and large-model serving to reduce per-device memory requirements.
Floating-Point Operations (FLOPs)FLOPs count the number of floating-point arithmetic operations required by a model or workload. They are a useful compute proxy for comparing training or inference cost, though real speed also depends on memory traffic, parallelism, and hardware utilization.
Mixed Precision TrainingMixed-precision training performs most computation in lower precision such as FP16 or bfloat16 while keeping selected quantities in higher precision for stability. It reduces memory use and often increases throughput without much accuracy loss when implemented carefully.
Needle in a HaystackNeedle in a Haystack is a long-context benchmark that tests whether a model can retrieve a small target fact embedded inside a large distractor context. It is useful for measuring position-sensitive retrieval, but strong needle scores do not guarantee broader long-document reasoning.
Relative Position EncodingRelative position encoding represents how far apart tokens are rather than assigning each position a standalone ID. That lets attention depend on distance or offset, which often improves length generalization and transfers patterns more naturally across positions.
Attention MechanismAttention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
Encoder-Decoder ArchitectureAn encoder-decoder architecture uses an encoder to turn an input sequence into representations and a decoder to generate an output sequence conditioned on those representations. It is the standard design for translation, summarization, and other input-to-output generation tasks.
Nearest Neighbor SearchNearest neighbor search finds the stored vectors most similar to a query under a chosen distance or similarity metric. Exact search is simple but expensive at scale, so large systems often use approximate nearest neighbor indexes instead.
Vector DatabaseA vector database is a system optimized for storing embeddings and retrieving nearest neighbors together with metadata filtering, updates, and persistence. It is the common serving layer behind semantic search and many RAG systems.
Dense RetrievalDense retrieval represents queries and documents with learned dense embeddings and retrieves by vector similarity. It handles paraphrase and semantic matching better than sparse retrieval, but it can miss exact lexical constraints and usually relies on approximate nearest neighbor search.
FactualityFactuality is whether the content of an answer is actually true in the world or according to trusted sources. An answer can be fluent and even faithful to its source while still being nonfactual if the source itself is wrong or outdated.
CalibrationCalibration measures whether predicted probabilities match observed frequencies, so events predicted at 70% should occur about 70% of the time. A model can be accurate but poorly calibrated if its confidence is systematically too high or too low.
Temperature ScalingTemperature scaling calibrates a classifier by dividing logits by a learned scalar temperature before the softmax. It often improves probability calibration on a validation set without changing the model’s ranking of classes.
Batch NormalizationBatch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
Layer NormalizationLayer normalization normalizes activations across features within each example rather than across the batch. It works well for variable-length sequences and small batch sizes, which is why it is standard in Transformers.
Gradient ClippingGradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
Weight DecayWeight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
RMSPropRMSProp uses an exponential moving average of squared gradients to normalize updates, preventing the denominator from growing without bound as in AdaGrad. It is useful for nonstationary problems and was a key precursor to Adam.
Cosine AnnealingCosine annealing decays the learning rate following a cosine curve from a high value to a low value, sometimes with restarts. It provides a smooth schedule that often works well in practice without needing many hand-tuned decay boundaries.
He Initialization (Kaiming Initialization)He initialization sets weight variance to roughly 2/fan-in so ReLU-like activations preserve signal magnitude through depth. It improves on Xavier initialization for one-sided activations that zero out about half the inputs.
Label SmoothingLabel smoothing replaces hard one-hot targets with a mostly-correct probability distribution that assigns a small amount of mass to other classes. It regularizes overconfident classifiers and often improves generalization and calibration, though it can hurt when exact probabilities matter.
Instruction TuningInstruction tuning is supervised fine-tuning on instruction-response examples so a pretrained model learns to follow requests instead of merely continuing text. It improves task generality and usability, but it mainly changes behavior and format-following rather than adding much new world knowledge.
Red-TeamingRed-teaming is adversarial evaluation in which people or automated systems deliberately try to break a model’s safeguards and expose failure modes. Its purpose is not to improve benchmark scores directly, but to find unsafe or brittle behavior before deployment.
BLEUBLEU is a machine-translation metric based mainly on n-gram precision against one or more reference texts, combined with a brevity penalty. It is useful for corpus-level comparison, but it often misses meaning-preserving paraphrases and is weak as a sentence-level quality measure.
A/B Testing (ML Systems)A/B testing in ML systems is a randomized online experiment that serves different model variants to different user groups and compares outcome metrics. It is the standard way to measure real production impact, because offline wins do not always translate into better user experience.
Continual LearningContinual learning is the problem of learning from a sequence of tasks or data distributions without losing previously acquired capabilities. Its core challenge is the stability-plasticity tradeoff: the model must remain adaptable to new data without catastrophically overwriting old knowledge.
Curriculum LearningCurriculum learning trains a model on examples in an organized order, usually from easier or more structured cases to harder ones. The idea is to improve optimization and generalization by shaping the training distribution, though a bad curriculum can also slow learning or bias the model.
Instruction DatasetAn instruction dataset is a curated set of prompts paired with preferred responses used to teach a pretrained model how to behave as an assistant. It mainly teaches task framing, format, and interaction style rather than the broad world model that comes from pretraining.
Preference DatasetA preference dataset contains prompts with ranked, paired, or binary-labeled responses indicating which outputs are preferred. It is the standard supervision source for reward modeling and direct preference objectives because it expresses comparative quality better than a single gold answer.
Value FunctionA value function estimates expected future return, either from a state or from a state-action pair under a policy. It matters because it turns delayed rewards into local training signals, enabling planning, bootstrapping, and lower-variance policy gradients.
Attention VisualizationAttention visualization renders attention weights as heatmaps or token-to-token graphs so we can see which positions a model attends to. It is a useful diagnostic tool, but attention weights alone are not a complete explanation of what the model is computing.
Probing (Neural Networks)Probing tests whether information is encoded in a model’s hidden states by training a simple classifier or regressor on those representations. A successful probe shows that the information is recoverable, but not necessarily that the model causally uses it.
Activation AnalysisActivation analysis studies the intermediate activations produced during a forward pass rather than only the model’s static weights. By examining which neurons, channels, or directions fire in different contexts, it helps connect internal representations to model behavior.
What are emergent capabilities in large language models?Emergent capabilities in large language models are abilities that look weak or absent at small scale but become strong once the model is large enough. The key caveat is that “emergence” can depend on the metric and threshold used, so apparent jumps are not always literal discontinuities in the underlying capability.
Zero-Shot LearningZero-shot learning is the ability to perform a task from a description alone, without task-specific training examples in the prompt or fine-tuning data. In LLMs it is a direct consequence of broad pretraining and instruction-following ability.
One-Shot LearningOne-shot learning is the ability to learn or generalize from a single labeled example or demonstration. It matters because many real tasks do not provide large datasets, so the model must infer the rule from minimal evidence.
Active LearningActive learning is a training strategy that selectively asks for labels on the most informative unlabeled examples instead of labeling data uniformly at random. Its purpose is to reduce annotation cost by spending human effort where uncertainty or disagreement is highest.
Classification HeadA classification head is the final task-specific layer that maps learned representations to class logits or probabilities. In transfer learning it is often the only part trained from scratch, while the backbone provides reusable features.
Information RetrievalInformation retrieval is the problem of finding and ranking the documents, passages, or items most relevant to a query. Modern systems combine lexical matching, learned embeddings, and ranking models because exact term overlap and semantic similarity each capture different kinds of relevance.
Memory Optimization (ML Training)Memory optimization in ML training is the collection of techniques that reduce peak memory so larger models or batches fit on available hardware. Common examples are mixed precision, activation checkpointing, optimizer sharding, offloading, and more memory-efficient attention kernels.
Role-Playing (LLMs)Role-playing in LLMs means conditioning a model to adopt a persona, voice, or behavioral frame during generation. It is useful for simulation and product design, but it also shows how easily high-level behavior can be steered by prompt context.
Model EvaluationModel evaluation is the systematic measurement of how a model performs, fails, and trades off across tasks, metrics, and deployment contexts. Good evaluation combines offline benchmarks, stress tests, human judgment, and online metrics rather than relying on a single score.
Shadow DeploymentShadow deployment runs a new model in production alongside the live system without letting its outputs affect users. This makes it possible to compare latency, quality, and failure modes on real traffic before committing to a risky rollout.
Feedback Loop (ML Systems)A feedback loop in an ML system occurs when the model’s outputs change the data it will later train on or be evaluated against. These loops can reinforce bias, distort demand, and make offline metrics look better even while the real system gets worse.
Data MixtureA data mixture is the weighting and composition of different datasets or domains in a training run. It matters because capability, robustness, and bias often depend as much on what proportion of the data comes from each source as on the total token count.
Steering VectorsSteering vectors are directions in activation space that, when added to hidden states, systematically change model behavior toward traits such as refusal, sentiment, or persona. They are useful because they show that some behaviors can be modified directly in representation space without full retraining.
Expectation–Maximization (EM) AlgorithmThe EM algorithm is an iterative method for maximum-likelihood or MAP estimation in models with latent variables. Each round first estimates the hidden structure under the current parameters and then re-optimizes the parameters as if that hidden structure were known.
Pretraining Data DeduplicationPretraining data deduplication removes near-duplicate documents or passages from a training corpus. It improves per-token efficiency and reduces memorization and benchmark contamination, because repeatedly seeing the same text usually wastes compute more than it adds knowledge.
REINFORCE AlgorithmREINFORCE is the basic Monte Carlo policy-gradient algorithm that updates parameters by weighting the log-probability of sampled actions by their returns. It is unbiased, but its variance is high, which is why practical methods usually add baselines or critics.
Actor–Critic MethodsActor-critic methods learn a policy and a value estimator at the same time. The actor chooses actions, while the critic estimates return and supplies a lower-variance learning signal than raw returns alone.
Self-Refine / ReflexionTwo closely related inference-time techniques in which an LLM critiques and revises its own output over multiple rounds. Self-Refine uses the same model in three roles (generate → feedback → refine); Reflexion adds an episodic memory of past failures to guide future trajectories in agentic tasks.
Core LLM Benchmarks (MMLU, HumanEval, GSM8K, MATH)MMLU tests broad academic knowledge, HumanEval tests code generation by unit tests, GSM8K tests grade-school math word problems, and MATH tests harder symbolic reasoning. Together they cover knowledge, code, and reasoning, but all can be gamed or saturated, so they are only a partial view of model quality.
LMSYS Chatbot ArenaLMSYS Chatbot Arena is a crowdsourced pairwise-evaluation platform where users compare two anonymous models by chatting and voting. Its Elo-style ranking captures interactive preference better than a single benchmark, but it is noisy and sensitive to traffic mix and prompt selection.
Mahalanobis DistanceThe metric \( d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})} \) measures distance in a space whitened by the covariance \( \Sigma \). It is the natural distance for Gaussian data, the log-likelihood core of a multivariate Gaussian, and the basis of Fisher's discriminant analysis.
Conjugate PriorsA prior is conjugate to a likelihood family if the posterior belongs to the same family as the prior. Conjugacy turns Bayesian updating into a closed-form parameter update and underlies analytical treatments of Beta–Binomial, Dirichlet–Multinomial, and Normal–Normal models.
Gaussian Mixture Models (GMM)A density model \( p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k) \) that represents data as a weighted sum of Gaussian components. Fitted by the EM algorithm, GMMs are the canonical example of latent-variable density estimation and the statistical cousin of \( k \)-means.
Variational Inference / ELBOA framework that turns Bayesian inference into optimisation: choose a tractable family \( q(\mathbf{z}; \phi) \) and maximise the evidence lower bound \( \mathcal{L}(\phi) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \), which simultaneously approximates the posterior and bounds the log-evidence. VAEs, variational Bayes, and amortised inference all descend from this objective.
Contrastive Learning (SimCLR / MoCo)Self-supervised visual representation learning via the InfoNCE loss: pull together two augmented views of the same image (positives) while pushing apart views of all other images (negatives). SimCLR uses in-batch negatives; MoCo uses a queued momentum encoder, enabling large effective negative pools with small batches.
Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
Residual Networks (ResNet as Architecture)A residual network replaces a plain layer stack with blocks that learn a residual update \(F(x)\) and add it back to the input, so each block computes \(y = x + F(x)\). This makes very deep CNNs much easier to optimize and triggered the shift from VGG-style stacks to residual architectures.
Transformer-XL / Segment-Level RecurrenceDai et al. (2019) extend Transformers beyond fixed context by caching hidden states of the previous segment and allowing attention to read from them — a simple "segment-level recurrence" that gives an effective receptive field of \( N \cdot L \) for \( L \) layers and segment length \( N \). Paired with relative positional encoding, it was a key bridge between pure attention and long-context models.
Longformer / BigBird (Sparse Long-Context Attention)Fixed sparsity patterns that reduce attention from \( O(n^2) \) to \( O(n) \) for long documents. Longformer combines sliding-window + global attention; BigBird adds random attention and proves the result retains full-attention universal-approximation properties. Both were pre-2022 answers to scaling Transformers to 4k–16k tokens.
Rejection Sampling Fine-Tuning (RFT)An offline alignment method: sample many completions from a base model, keep only those that pass a quality / reward filter, and fine-tune on the survivors. RFT is the simplest way to convert a reward model (or a verifier) into improved policy behaviour, and serves as a strong baseline against DPO / PPO.
Best-of-N Sampling and its ScalingAn inference-time boost: sample \( N \) responses from a language model, score them with a reward model or verifier, and return the best. Quality scales with \( \log N \); scaling laws predict the knob where best-of-\( N \) inference compute equals extra training compute, and motivate distillation of best-of-\( N \) behaviour back into the policy via RFT.
Feature Attribution (Integrated Gradients, SHAP)Post-hoc interpretability methods that attribute a model's prediction to its input features. Integrated Gradients (Sundararajan et al., 2017) integrates gradients along a path from a baseline to the input. SHAP (Lundberg & Lee, 2017) builds on Shapley values from cooperative game theory, giving a unique attribution with axiomatic fairness properties.
Tool Use / Function Calling Benchmarks (BFCL, τ-bench)BFCL (Berkeley Function Calling Leaderboard, 2024) and τ-bench (2024) evaluate LLMs' ability to select, parameterise, and sequence API calls. BFCL is single-turn function-call accuracy; τ-bench is multi-turn dialogue in simulated customer-service environments with realistic state and policy constraints.
Moore–Penrose PseudoinverseFor any \( A \in \mathbb{R}^{m \times n} \) with SVD \( A = U\Sigma V^\top \), the Moore–Penrose pseudoinverse is \( A^+ = V\Sigma^+ U^\top \), where \( \Sigma^+ \) inverts the non-zero singular values and transposes. \( A^+ b \) is the minimum-norm least-squares solution to \( Ax = b \) — the canonical generalisation of \( A^{-1} \) for rectangular and rank-deficient matrices.
QR DecompositionAny \( A \in \mathbb{R}^{m \times n} \) with \( m \ge n \) factors as \( A = QR \), with \( Q \in \mathbb{R}^{m \times n} \) having orthonormal columns and \( R \in \mathbb{R}^{n \times n} \) upper triangular. It is the numerically preferred route to OLS, avoids squaring the condition number, and is the work-horse of least-squares solvers in every major numerical library.
Cholesky DecompositionEvery symmetric positive-definite matrix \( A \) factors uniquely as \( A = LL^\top \) with \( L \) lower triangular with positive diagonal. Cholesky is twice as fast as LU, numerically stable without pivoting, and the standard building block for Gaussian-process inference, Kalman filters, and sampling from multivariate normals.
Matrix Calculus and DifferentialsMatrix calculus expresses derivatives of scalar, vector, and matrix-valued functions of matrix inputs in compact form, avoiding entrywise index chasing. Working with the differential \( dL = \text{tr}(G^\top dX) \) identifies \( \nabla_X L = G \) directly and turns backpropagation into linear algebra.
Convex Optimization FundamentalsA convex optimization problem minimises a convex function over a convex feasible set. Its defining property: every local minimum is global. Convexity also gives polynomial-time algorithms with strong guarantees — the backdrop against which non-convex deep learning is understood as a deliberate departure.
KKT Conditions and Lagrangian DualityThe Karush–Kuhn–Tucker (KKT) conditions generalise \( \nabla f = 0 \) to constrained optimisation, giving necessary (and, for convex problems, sufficient) conditions for a minimum. Lagrangian duality transforms the constrained primal into a dual over multipliers; for convex problems with Slater's condition, the duality gap closes — the workhorse derivation behind SVMs, interior-point methods, and much of RL.
Newton's Method and Quasi-Newton (L-BFGS)Newton's method uses the second-order model \( f(x+\Delta) \approx f(x) + g^\top \Delta + \tfrac{1}{2}\Delta^\top H \Delta \) to take the step \( \Delta = -H^{-1} g \). It converges quadratically near a minimum but is impractical in high dimensions. Quasi-Newton methods (BFGS, L-BFGS) approximate \( H^{-1} \) from gradient histories, retaining super-linear convergence at the cost of a Hessian solve per step.
Proximal Gradient Methods (ISTA / FISTA)Proximal gradient methods solve objectives of the form \(f(x)+g(x)\) where \(f\) is smooth and \(g\) is nonsmooth but has an easy proximal operator. Each step does gradient descent on \(f\) and then a shrinkage- or projection-like update for \(g\), which is why the method underlies Lasso, sparse coding, and constrained convex optimization.
Conjugate Gradient MethodConjugate gradient (CG) solves \( Ax = b \) for symmetric PD \( A \) using only matrix-vector products, converging in at most \( n \) steps in exact arithmetic and much faster when the eigenvalue spectrum is clustered. CG underlies truncated-Newton optimisation, the natural-gradient in K-FAC, and large-scale Gaussian-process inference.
Beta and Dirichlet DistributionsThe Beta \( (\alpha, \beta) \) is the conjugate prior for Bernoulli/Binomial likelihoods; the Dirichlet \( (\boldsymbol{\alpha}) \) is its \( K \)-class generalisation, conjugate to Categorical/Multinomial. Both live on probability simplices and make Bayesian updating a matter of adding counts to hyperparameters — the cleanest introduction to conjugate Bayesian inference and the scaffolding for LDA.
Poisson and Exponential DistributionsPoisson \( (\lambda) \) models counts of independent events in a fixed interval; Exponential \( (\lambda) \) models the waiting time between them. They are the two marginals of the Poisson process — one discrete, one continuous — and between them cover rare-event counts, waiting times, and the building blocks of survival analysis, rate modelling, and queueing theory.
Mutual Information and Conditional EntropyConditional entropy \( H(Y \mid X) \) measures the residual uncertainty in \( Y \) given \( X \). Mutual information \( I(X; Y) = H(Y) - H(Y \mid X) \) measures how much \( X \) reduces the uncertainty in \( Y \) — equivalently, the KL divergence from the joint \( p(X,Y) \) to the product of marginals \( p(X)p(Y) \). Together they quantify dependence, drive information-bottleneck theory, and define decision-tree splits.
Hypothesis Testing, p-values, and Statistical PowerA hypothesis test compares a null \( H_0 \) to an alternative \( H_1 \) by computing a test statistic and its tail probability under \( H_0 \) — the p-value. Statistical power is \( 1 - \beta \), the probability of rejecting \( H_0 \) when \( H_1 \) is true. For ML evaluation, these are the tools that separate "this model is better" from "this split was lucky".
Goodhart's Law and Specification GamingGoodhart's Law says that once a measure becomes a target, optimizing it can break its usefulness as a proxy. In AI safety and ML systems this appears as specification gaming: the model finds ways to maximize the metric without achieving the intended goal.
t-SNE (t-Distributed Stochastic Neighbor Embedding)t-SNE embeds high-dimensional points into 2D/3D by matching a heavy-tailed joint \( Q \) in the embedding space to a Gaussian-kernel joint \( P \) over neighbourhoods in the input space, minimising \( D_{\text{KL}}(P\|Q) \). The heavy Student-t tail in \( Q \) solves the crowding problem; perplexity tunes neighbourhood size. t-SNE preserves local structure well but distorts global distances.
UMAP (Uniform Manifold Approximation and Projection)UMAP constructs a fuzzy topological representation of the data using local Riemannian metrics, then finds a low-dimensional embedding whose fuzzy topology matches. Its cross-entropy objective has distinct attractive and repulsive terms — easier to scale than t-SNE — and a theoretical motivation grounded in category theory and Riemannian geometry. In practice, UMAP preserves more global structure than t-SNE at similar or better speed.
DBSCAN (Density-Based Clustering)DBSCAN clusters points by density: a point is a core point if it has at least \( \text{minPts} \) neighbours within radius \( \varepsilon \); clusters are connected components of core-point neighbourhoods. Non-core points without core neighbours are labelled noise. Unlike \( k \)-means, DBSCAN does not need the number of clusters in advance and handles arbitrary cluster shapes, at the cost of \( \varepsilon \) tuning.
Linear and Quadratic Discriminant Analysis (LDA / QDA)LDA and QDA are generative classifiers: model class-conditionals \( p(x \mid y = k) = \mathcal{N}(\mu_k, \Sigma_k) \) and apply Bayes' rule. LDA assumes shared covariance \( \Sigma \) across classes, giving linear decision boundaries and shrinkage-like regularisation; QDA uses per-class \( \Sigma_k \), giving quadratic boundaries at the cost of \( K \times d^2 \) parameters. Both are the generative counterparts of logistic regression.
Matrix Factorisation and Collaborative FilteringMatrix factorisation decomposes a sparse user–item rating matrix \( R \approx UV^\top \) into low-rank user and item embeddings, learned by minimising squared error on observed entries with regularisation. This is the canonical collaborative-filtering recipe behind Netflix-style recommenders, the spiritual ancestor of modern embedding-based retrieval, and a clean worked example of low-rank learning.
Multi-Armed Bandits: ε-Greedy, UCB, Thompson SamplingA multi-armed bandit is the simplest setting for exploration vs exploitation: \( K \) arms, each with unknown reward distribution; pull one per step; minimise cumulative regret. The three canonical strategies — ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling — illustrate optimism, confidence-based exploration, and probability matching respectively, and generalise to contextual bandits and full RL.
Temporal-Difference Learning: TD(0), SARSA, TD(λ)Temporal-difference learning updates value estimates using the Bellman bootstrapped target \( R_t + \gamma V(S_{t+1}) \) rather than a full Monte Carlo return. TD(0) is the one-step instance; SARSA extends it on-policy to action-values; TD(λ) interpolates between TD(0) and Monte Carlo via eligibility traces. TD is the central learning rule behind Q-learning, DQN, and actor-critic.
BERT: Bidirectional Encoder PretrainingBERT (Devlin et al., 2019) is a Transformer encoder pretrained with masked language modelling (MLM) and next-sentence prediction (NSP), producing bidirectional contextual representations. Unlike GPT's causal, left-to-right pretraining, MLM sees both past and future tokens, making BERT the go-to encoder for classification, span extraction, and sentence embeddings. Base/Large variants (110M/340M params) dominated GLUE/SQuAD until decoder-only LLMs took over.
WordPiece, Unigram, and SentencePiece TokenisationWordPiece, Unigram, and SentencePiece are subword tokenization schemes used to balance vocabulary size against sequence length. WordPiece builds a vocabulary greedily, Unigram prunes a probabilistic vocabulary by likelihood, and SentencePiece is the language-agnostic toolkit that trains and applies these methods on raw text.
Adapter Layers (Houlsby-style)Houlsby-style adapters are small bottleneck MLPs inserted inside each Transformer block while the original backbone is frozen. They were the first clean PEFT recipe: add a few trainable parameters per layer, keep the base model unchanged, and specialize the model to new tasks cheaply.
GPU Memory Hierarchy and the Roofline ModelThe GPU memory hierarchy ranges from slow, large global memory to faster on-chip caches, shared memory, and registers. The roofline model explains performance by comparing arithmetic intensity with hardware limits: low-intensity kernels are memory-bound, while high-intensity kernels are compute-bound.
Arithmetic IntensityArithmetic intensity is the number of floating-point operations performed per byte of data moved from memory. In the roofline model it determines whether a kernel is memory-bound or compute-bound, which is why matmuls are efficient and elementwise ops often are not.
Graph Convolutional Network (GCN)Kipf & Welling's GCN (2017) applies a first-order spectral convolution on graphs: each node's representation is updated as a normalised sum of its neighbours' features, \( H^{(\ell+1)} = \sigma(\tilde D^{-1/2} \tilde A \tilde D^{-1/2} H^{(\ell)} W^{(\ell)}) \). The symmetric normalisation comes from a spectral argument; practically, it is the simplest and most widely-taught GNN layer, setting the template for all message-passing architectures.
Graph Attention Network (GAT)GAT (Veličković et al., 2018) replaces GCN's fixed degree-normalised aggregation with attention : each node learns per-edge weights via a shared attention mechanism. This gives inductive generalisation (no dependence on the full graph's degree matrix), handles heterogeneous neighbourhoods, and approaches Transformer-style flexibility — though at higher computational cost than GCN.
Metropolis–Hastings AlgorithmThe canonical MCMC recipe: propose \( \theta' \sim q(\theta' \mid \theta) \), accept with probability \( \min(1, \pi(\theta')q(\theta\mid\theta')/[\pi(\theta)q(\theta'\mid\theta)]) \). Produces a Markov chain with stationary distribution \( \pi \) for any valid proposal, turning intractable posterior sampling into a correctness-guaranteed iterative procedure.
Gibbs SamplingA special case of Metropolis–Hastings: cycle through variables, sampling each from its full conditional \( p(\theta_j \mid \theta_{-j}, D) \). When the conditionals are tractable (conjugate priors, mixture models, LDA), Gibbs is simple, always accepts, and has been the workhorse of Bayesian inference for four decades.
XGBoost, LightGBM, and CatBoostThree production gradient-boosted-decision-tree implementations with distinct tree-construction strategies: XGBoost does level-wise exact/approximate splits with second-order Taylor objective; LightGBM uses histogram-based leaf-wise growth and GOSS subsampling; CatBoost uses ordered boosting to avoid target leakage with categorical features.
Spectral Clustering & the Graph LaplacianConstruct a similarity graph over data points, embed each point into \( \mathbb{R}^k \) using the \( k \) smallest eigenvectors of the graph Laplacian, and run k-means in that space. The Laplacian's spectrum encodes cluster structure as low-frequency eigenmodes — works for non-convex clusters where Euclidean k-means fails.
Non-Negative Matrix Factorization (NMF)Factorise a nonnegative matrix \( V \approx W H \) with \( W, H \ge 0 \) entrywise. The nonnegativity constraint yields parts-based, interpretable components (topic–word, basis–image) and distinguishes NMF from PCA, whose sign-free components are typically holistic and hard to name.
Independent Component Analysis (ICA)ICA separates a linear mixture \( x = A s \) into statistically independent non-Gaussian sources by finding a de-mixing matrix \( W \) that maximises the non-Gaussianity of \( y = W x \). Classical application: the cocktail-party problem. Key distinction from PCA: maximises independence, not variance.
Kernel PCAPrincipal component analysis in the implicit feature space of a positive-definite kernel \( k(x, y) \). Eigendecomposes the centred Gram matrix \( K \) rather than the data covariance; recovers non-linear principal directions without ever instantiating the feature map. Used for non-linear dimensionality reduction and feature extraction.
Isomap & Locally Linear Embedding (LLE)Two classical manifold-learning algorithms: Isomap replaces Euclidean distances with geodesic distances on a \( k \)-NN graph and applies MDS; LLE reconstructs each point from its neighbours' linear weights and finds a low-dimensional embedding that preserves those weights. Both set the conceptual stage for t-SNE and UMAP.
Conditional Random Fields (CRF)A discriminative model of \( p(y \mid x) = \tfrac{1}{Z(x)} \exp \sum_k \lambda_k f_k(y, x) \) over structured outputs. Linear-chain CRFs add sequence-level constraints on top of per-token scores, enabling tractable training via forward–backward and Viterbi decoding — still the backbone of NER/tagging heads above neural encoders.
Bayesian Networks & Directed Graphical ModelsA Bayesian network is a DAG \( G \) over variables whose joint factorises as \( p(x) = \prod_i p(x_i \mid \text{pa}_G(x_i)) \). D-separation reads conditional independences off the graph; parameter learning uses MLE under complete data, EM under latent variables. The mathematical foundation of structured probabilistic modelling.
Variational Autoencoder (VAE)A latent-variable generative model trained by maximising the ELBO \( \mathcal{L}(x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\text{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \). The reparameterisation trick makes the encoder \( q_\phi \) differentiable; the decoder \( p_\theta \) learns to reconstruct \( x \) from latent codes \( z \sim \mathcal{N}(0, I) \).
Restricted Boltzmann Machines (RBM)A bipartite EBM over visible and hidden binary units with energy \( E(v, h) = -v^\top W h - b^\top v - c^\top h \). Conditional independence within each layer gives closed-form conditionals \( p(h\mid v), p(v\mid h) \); Hinton's Contrastive Divergence trains them and the RBM stack forms a deep belief net.
Noise-Contrastive Estimation (NCE)Learn an unnormalised model \( \tilde p_\theta(x) \) by training a binary classifier to distinguish data samples from noise samples. The classifier's logit becomes \( \log \tilde p_\theta(x) - \log q_{\text{noise}}(x) \), so the partition function is absorbed into a learnable constant. Foundation of word2vec's negative sampling and of InfoNCE contrastive learning.
U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
Semantic & Instance SegmentationSemantic segmentation assigns a class label to every pixel; instance segmentation further distinguishes object instances (two cats become two masks). Panoptic segmentation unifies them: one label per pixel with 'thing' vs 'stuff' classes. Backbones: FCN, DeepLab, Mask R-CNN, DETR/Mask2Former.
Object Detection: R-CNN → Faster R-CNN → DETRR-CNN ran a CNN classifier on externally-proposed regions; Fast R-CNN shared backbone features across proposals; Faster R-CNN introduced a learned Region Proposal Network; DETR replaced the entire region-proposal pipeline with a transformer that predicts a fixed set of boxes via bipartite matching.
YOLO Family (v1 – v10)Single-stage detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from a single CNN forward pass. YOLOv1 was real-time but coarse; YOLOv3–v10 progressively adopted anchor boxes, FPN, CSP blocks, decoupled heads, and finally anchor-free / NMS-free designs for edge deployment.
PointNet & 3D Deep Learning on Point CloudsPointNet processes an unordered point set by applying a shared MLP to each point, then pooling across points with a symmetric function (max-pool). Permutation-invariant by construction; PointNet++ adds local-region hierarchies to capture geometric structure.
Pointer NetworksA seq2seq architecture whose decoder outputs indices into the input sequence via attention weights (a 'pointer') rather than tokens from a fixed vocabulary. Ideal for combinatorial problems whose output vocabulary depends on the input (convex hull, TSP, sorting) and for extractive QA / span prediction.
Siamese Networks & Metric LearningTrain a shared encoder so that semantically similar inputs map to nearby embeddings (contrastive / triplet loss) or that query-key scores reflect similarity directly. Used in face verification, signature matching, image retrieval; the conceptual parent of SimCLR, CLIP, and BiEncoder retrieval.
Highway NetworksPredecessor to ResNet: a gated skip connection \( y = H(x) \cdot T(x) + x \cdot (1 - T(x)) \), where \( T(x) \in [0,1] \) is a learned transform gate. Enabled training of 100+ layer networks before residual connections simplified the construction in ResNet.
Capsule NetworksHinton's alternative to CNN pooling: neurons are grouped into 'capsules' whose vector output encodes both existence and pose of an entity. Dynamic routing by agreement replaces max-pool, so each capsule decides which higher-level capsule to vote for based on agreement. Historically significant; practically superseded by transformers.
Lion OptimizerA momentum-only optimizer discovered by neural search: updates are \( \theta \leftarrow \theta - \eta \, \text{sign}(\beta_1 m + (1-\beta_1) g) \). Uses only the sign of a momentum estimate — no second moment, half the state of AdamW, often matches or beats it on large-scale training with smaller learning rates.
AdaFactor OptimizerA memory-efficient Adam variant. Replaces the full second-moment matrix with row and column running averages (a rank-1 factorisation): memory is \( O(m + n) \) instead of \( O(mn) \). Used by T5, Meta's BlenderBot, and many large-scale models to free memory for bigger batches and longer contexts.
Focal Loss & Class-Imbalance ObjectivesFocal loss \( \text{FL}(p_t) = -(1-p_t)^\gamma \log p_t \) down-weights the loss contribution of confidently-classified easy examples, focusing gradient on hard ones. Designed for extreme foreground-background imbalance in one-stage detection; widely used in segmentation and long-tailed classification.
InfoNCE & NT-Xent Contrastive LossesInfoNCE maximises a mutual-information lower bound by classifying a positive pair against \( k \) negatives: \( \mathcal{L} = -\log \exp(s^+) / \sum_i \exp(s_i) \). NT-Xent is InfoNCE with temperature-scaled cosine similarities. Drives SimCLR, MoCo, CLIP, and most modern self-supervised representation learning.
Expected Calibration Error (ECE) & Reliability DiagramsMeasures miscalibration: \( \text{ECE} = \sum_b (|B_b|/n) \, |\text{acc}(B_b) - \text{conf}(B_b)| \). Predictions are bucketed by confidence; the gap between mean confidence and accuracy in each bucket sums to ECE. Deep nets are typically over-confident; temperature scaling restores calibration.
Platt Scaling & Isotonic RegressionPlatt scaling calibrates a classifier by fitting a one-dimensional logistic regression from raw scores to probabilities on a validation set. Isotonic regression is a more flexible monotonic alternative that can fit non-sigmoid calibration curves, but it usually needs more calibration data to avoid overfitting.
Bayesian Deep Learning: MC Dropout & Deep EnsemblesMC dropout estimates predictive uncertainty by keeping dropout active at test time and averaging many stochastic forward passes. Deep ensembles train several independently initialized models and usually give stronger uncertainty estimates, at higher training and serving cost.
Paired Bootstrap & McNemar's Test for Model ComparisonTwo non-parametric procedures for deciding whether model A beats model B with statistical significance. Paired bootstrap resamples matched predictions to estimate a confidence interval on the metric difference. McNemar's test uses a chi-squared on the \( 2\times 2 \) contingency table of agreement / disagreement.
Data Curation & Quality Filters (FineWeb, Dolma)Modern pretraining pipelines filter terabytes of web data through language ID, heuristic rules (repetition, punctuation ratios), classifier-based quality scoring, and toxicity / PII removal. The FineWeb and Dolma recipes document which filters mattered — often delivering per-token quality gains equivalent to 2–3× scale-up.
MinHash & LSH for Large-Scale DeduplicationMinHash approximates Jaccard similarity between sets, and locality-sensitive hashing uses that approximation to quickly find near-duplicates. In ML data pipelines this is used to deduplicate documents or examples at very large scale.
Synthetic Data Generation for Post-TrainingModern instruction-tuning and RL-based alignment rely on LLM-generated synthetic data: self-instruct / Evol-Instruct expand seed prompts, teacher models produce high-quality completions, and process-reward models validate chain-of-thought steps. The backbone of the post-ChatGPT post-training stack.
Continual Pretraining & Mid-TrainingContinue pretraining an existing base model on a domain or task-focused corpus (code, math, a new language) before final post-training. Achieves domain gains that would cost 10× more to obtain by fine-tuning alone. Sits between pretraining and SFT in modern recipes.
Tokenizer Training Dynamics & Vocab SizingTokenizer design trades vocabulary size against sequence length and changes both compute cost and what patterns the model can represent cleanly. BPE, unigram, and byte-level schemes make different compromises, especially for code, multilingual text, and rare domain terms.
Refusal Training & Harmlessness ObjectivesTeach a model to decline unsafe or out-of-policy queries while remaining helpful on benign ones. Typical recipe: SFT on curated refusal examples, preference optimisation with harm-labelled pairs, red-team-driven iteration. Badly done, this causes over-refusal (the 'brittle helpfulness' regime); done well, it achieves high refusal rate with minimal utility loss.
Plan-and-Solve PromptingTwo-stage prompt pattern: first ask the model to produce a plan (step-by-step outline), then execute the plan step by step. Improves over naive zero-shot CoT on arithmetic and multi-hop reasoning by forcing explicit decomposition. Contrasts with ReAct's interleaved thought-action loop.
Constrained Decoding: Grammars, JSON Mode, RegexGrammar-constrained decoding masks any next token that would violate a target grammar or schema. This guarantees outputs such as valid JSON, XML, or regex-matching strings by restricting generation to the language accepted by a finite-state machine or pushdown automaton.
Model Context Protocol (MCP)Model Context Protocol is an open client-server protocol for connecting models and agents to external tools, resources, and prompts through a common interface. It standardizes capability discovery and tool invocation so one MCP-aware client can talk to many different servers without bespoke integrations.
Continuous vs Static BatchingStatic batching groups requests before a forward pass and runs them to completion together — tail latency is set by the slowest request. Continuous batching (Orca, vLLM) evicts finished requests mid-step and admits new ones each iteration, keeping GPU utilisation high and tail latency bounded. Default in production LLM serving.
Federated Learning (FedAvg)Train a shared model across many clients (phones, hospitals) without centralising data. Each round: clients train locally for a few epochs, server averages their weight updates. Introduces non-IID-data, communication-cost, and privacy / security challenges absent in centralised training.
Membership Inference Attacks (MIA)Determine whether a specific example was in the training set of a deployed model. Attacks exploit loss / confidence gaps between seen and unseen examples — trained models are typically more confident on memorised training points. A baseline for privacy leakage in ML systems.
Embedding GeometryEmbedding geometry studies what information is encoded in the distances, angles, and directions of an embedding space. Properties like similarity, analogy structure, anisotropy, and clustering determine how useful an embedding is for retrieval and downstream tasks.
Tokenization as RepresentationTokenization is not just preprocessing: it decides which units the model can represent directly and therefore shapes the statistics the model learns. The choice of characters, subwords, bytes, or domain-specific tokens changes sequence length, vocabulary size, inductive bias, and how cleanly concepts map into embeddings.
Generative Model Evaluation (FID, IS, and their limits)Fréchet Inception Distance (FID) and Inception Score (IS) are the standard automated metrics for image generative models; both rely on Inception-v3 features and have well-known biases. Modern T2I evaluation supplements them with CLIPScore, prompt-adherence benchmarks (T2I-CompBench, GenEval), human-preference Elo (ImageReward, HPS), and likelihood / NLL where applicable.
Memory in LLMsAn LLM's working memory is the KV-cache for the current context window; longer-term memory is implemented externally by retrieval over vector / hybrid stores, by writing to scratchpads or tool state, or by parameter updates (continual fine-tuning). Each option has a different latency, capacity, and forgetting profile; production systems combine all three.
Exploration vs Exploitation (Deep RL View)Exploration versus exploitation is the trade-off between taking actions that seem best under current knowledge and taking actions that improve knowledge of the environment. In deep RL the problem is harder than in bandits because rewards can be sparse, state spaces are large, and short-term randomness is often not enough to discover long-horizon strategies.
Efficient Inference (Distillation, Pruning)Efficient inference reduces latency, memory, or serving cost without retraining a model from scratch, often by distilling a smaller student or pruning unimportant weights and structures. Distillation transfers behavior; pruning removes computation, and the two are often combined with quantization for deployment-grade compression.
Data-Centric AIData-centric AI treats data quality, labeling, coverage, and feedback loops as first-class levers of model performance rather than focusing only on architecture or hyperparameters. The core workflow is to diagnose failure slices, improve the dataset that generates those failures, and measure whether the model gets better for the right reasons.
Dataset Versioning & LineageDataset versioning and lineage track exactly which raw data, labels, transformations, and filters produced a training or evaluation set. They matter because reproducibility, compliance, rollback, and debugging all depend on being able to answer "which data built this model?" with more precision than a folder name or timestamp.
Feature StoresFeature stores are systems for defining, computing, and serving reusable machine-learning features consistently across training and production. Their core promise is point-in-time correctness and train/serve consistency: the feature a model saw offline should match the feature served online for the same entity and timestamp.
Distribution Shift & Dataset ShiftDistribution shift occurs when the joint distribution seen at deployment differs from the one used for training or validation. The main cases are covariate shift, label shift, and concept shift; each breaks generalization in a different way and therefore requires different detection and mitigation strategies.
Sufficient StatisticsA sufficient statistic is a summary of the sample that retains all information about a parameter relevant for inference. This is why many classical models can replace an entire dataset with counts, sums, or means without changing the likelihood-based conclusions about the parameter.
Bayes Risk and the Bayes Optimal ClassifierBayes risk is the minimum achievable expected loss under the true data distribution, and the Bayes-optimal classifier attains it by minimizing posterior expected loss for each input. Under ordinary 0–1 loss, that rule becomes “predict the class with highest posterior probability.”
Proper Scoring RulesA scoring rule is proper if a forecaster minimizes expected score by reporting their true predictive distribution. Proper scoring rules matter because they reward honest, calibrated probabilities rather than merely getting the top-ranked class right.
Surrogate Losses and Classification CalibrationSurrogate losses replace hard-to-optimize 0–1 classification loss with tractable objectives such as logistic or hinge loss. A surrogate is classification-calibrated if optimizing it still drives the classifier toward the Bayes-optimal decision rule.
Structural Risk Minimization (SRM)Structural risk minimization extends empirical risk minimization by balancing training fit against model complexity. It is the learning-theoretic principle behind regularization, margin control, and choosing among hypothesis classes of different capacity.
Precision-Recall Curve and Average PrecisionA precision-recall curve shows how precision and recall trade off as the decision threshold moves through a ranked list of predictions. Average precision summarizes that curve and is especially informative when the positive class is rare.
Hinge LossHinge loss penalizes examples that are misclassified or that lie too close to the decision boundary. It is the convex margin-based loss underlying soft-margin SVMs and emphasizes confident separation rather than calibrated probabilities.
Cost-Sensitive LearningCost-sensitive learning assigns different penalties to different kinds of mistakes instead of treating every error equally. It is the right framework when the real objective is to minimize downstream harm or utility loss rather than raw misclassification rate.
Class Imbalance and ReweightingClass imbalance means some labels are much rarer than others, so an unweighted objective can be dominated by the majority class. Reweighting changes the loss or sampling scheme so rare classes exert more influence during training.
Dynamic Programming for RLDynamic programming solves an MDP with a known model by repeatedly applying Bellman updates until values or policies become self-consistent. Policy evaluation, policy improvement, policy iteration, and value iteration are the core algorithms in that family.
Potential Outcomes FrameworkThe potential outcomes framework defines causal effects by comparing the outcomes a unit would have under different treatments. Because only one of those potential outcomes is observed for any given unit, causal inference is fundamentally about identifying missing counterfactuals under defensible assumptions.
Confounding, Colliders, and Simpson’s ParadoxConfounders create misleading associations because they affect both treatment and outcome, while colliders create bias when you condition on them. Simpson’s paradox is the visible symptom that aggregate and stratified associations can reverse direction when the underlying causal structure is ignored.
Attention Is All You Need“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.
Multiple Hypothesis Testing and False Discovery RateMultiple hypothesis testing asks how to control false positives when many tests are run at once. False discovery rate control, especially the Benjamini–Hochberg procedure, limits the expected fraction of rejected hypotheses that are actually null and is usually less conservative than family-wise error control.
Likelihood Ratio TestsA likelihood ratio test compares how well two nested statistical models explain the same data by taking the ratio of their maximized likelihoods. Large likelihood-ratio statistics indicate that the larger model fits substantially better than the restricted one, and under regularity conditions the test statistic is asymptotically chi-squared.
Importance SamplingImportance sampling estimates an expectation under a target distribution by drawing samples from a different proposal distribution and reweighting them. It is powerful when the proposal places more mass in the important regions of the integrand, but unstable weights can make the variance explode.
Bootstrap Confidence IntervalsBootstrap confidence intervals estimate uncertainty by resampling the observed dataset with replacement and recomputing the statistic many times. They are useful when analytic standard errors are awkward, but they inherit the sample's biases and can fail when the original sample is too small or unrepresentative.
One-Class SVMA one-class SVM learns a boundary around mostly normal data by separating mapped training points from the origin with maximum margin in feature space. It is a classic novelty-detection method because it needs examples of the inlier class but not labeled anomalies.
Latent Dirichlet Allocation (LDA Topic Models)Latent Dirichlet Allocation models each document as a mixture of latent topics and each topic as a distribution over words. It is a generative model for uncovering coarse semantic structure in bag-of-words corpora, not a modern contextual language model.
Factor AnalysisFactor analysis models observed variables as linear combinations of a small number of latent factors plus variable-specific noise. It is useful when the goal is to explain covariance structure rather than merely reduce dimension, which is the key difference from PCA.
Kalman FilterThe Kalman filter recursively estimates the hidden state of a linear Gaussian dynamical system by alternating a prediction step with a measurement update. It is optimal for that model class because the posterior remains Gaussian and is fully described by a mean and covariance.
Missing Data and ImputationMissing-data methods try to preserve inference when some values are unobserved by modeling why data are missing and how to fill or integrate over the missing entries. The key distinction is between MCAR, MAR, and MNAR, because imputation is far safer when missingness can be treated as conditionally ignorable.
Ablation Studies and Experimental ControlAn ablation study removes or alters one component of a system to measure how much that component actually contributes. Experimental control matters because an ablation is only informative when the comparison keeps everything else fixed, including data, tuning budget, and evaluation protocol.
Value IterationValue iteration solves a known Markov decision process by repeatedly applying the Bellman optimality backup until the value function converges. Once the optimal value is approximated, a greedy policy with respect to that value is optimal or near-optimal.
Policy IterationPolicy iteration alternates between evaluating the current policy and improving it by acting greedily with respect to that value function. It often converges in fewer outer loops than value iteration because each improvement step uses a more fully solved subproblem.
Monte Carlo Reinforcement LearningMonte Carlo reinforcement learning estimates values from complete sampled returns rather than from one-step bootstrapped targets. That makes the targets unbiased with respect to the episode return, but usually higher variance than temporal-difference methods.
Credit Assignment ProblemThe credit assignment problem is the problem of determining which earlier actions, states, or internal computations deserve blame or credit for a later outcome. It is hard because rewards and losses are often delayed, sparse, or distributed across many interacting decisions.
CounterfactualsA counterfactual asks what would have happened under a different action or treatment than the one that actually occurred. The central difficulty is that for any individual unit, only one potential outcome is observed, so causal inference always requires assumptions to recover the missing alternative.
Bahdanau AttentionBahdanau attention is the original additive attention mechanism for sequence-to-sequence models, where the decoder scores each encoder state before producing the next token. It solved the fixed-context bottleneck of early seq2seq RNNs by letting the decoder look back over the whole source sequence at every step.
Seq2Seq with AttentionSeq2seq with attention augments the encoder-decoder architecture so the decoder conditions on a context vector built from all encoder states at each output step. That change made neural machine translation far more effective than fixed-context seq2seq and directly paved the way to modern cross-attention and Transformer models.