Tag: implementation

188 topic(s)

Layer Dropping and Progressive Pruning (TrimLLM)Layer dropping and progressive pruning reduce inference cost by cutting transformer depth rather than shrinking every matrix. TrimLLM does this progressively for domain-specialized LLMs, exploiting the empirical fact that not all layers are equally important in a target domain and aiming to retain in-domain accuracy while reducing latency.
Critical Representation Fine-Tuning (CRFT)Critical Representation Fine-Tuning (CRFT) is a PEFT method that improves reasoning by editing a small set of causally important hidden states instead of updating model weights broadly. It identifies critical representations through information-flow analysis and learns low-rank interventions on those states while keeping the base model frozen.
ZeRO (Zero Redundancy Optimizer)ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and eventually parameters across data-parallel workers so each GPU no longer stores a full copy of the training state. This cuts memory dramatically and makes very large-model training feasible without requiring full model-parallel architectures.
Weight TyingWeight tying uses the same matrix for token embeddings and the output softmax projection, typically by setting the output weights to the transpose of the input embedding table. This cuts parameters and often improves language modeling by forcing input and output token representations to share geometry.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
PagedAttentionPagedAttention stores the KV cache in fixed-size non-contiguous blocks, like virtual-memory pages, instead of requiring one contiguous allocation per sequence. This largely removes fragmentation, enables prompt-prefix sharing, and is a key reason vLLM can serve many more concurrent requests.
Speculative DecodingSpeculative decoding speeds up autoregressive generation by letting a small draft model propose several tokens and then having the large target model verify them in parallel. With the rejection-sampling correction from the original algorithm, the output distribution remains exactly the same as sampling from the target model alone.
Proximal Policy Optimization (PPO)Proximal Policy Optimization is a policy-gradient algorithm that improves a policy while clipping how far action probabilities can move from the previous policy in one update. In RLHF it is usually paired with a KL penalty so the model gains reward without drifting too far from a reference model.
AdamW OptimizerAdamW is Adam with decoupled weight decay: parameter shrinkage is applied directly to the weights instead of being mixed into the adaptive gradient update. This preserves the intended regularization effect and is why AdamW became the default optimizer for many Transformer models.
Byte Pair Encoding (BPE)Byte Pair Encoding is a subword tokenization method that repeatedly merges the most frequent adjacent symbols in a corpus. It builds a vocabulary between characters and whole words, which handles rare words better than word-level tokenization while keeping sequence lengths manageable.
Softmax TemperatureSoftmax temperature rescales logits before softmax to control randomness in the output distribution. Lower temperature makes probabilities sharper and decoding more deterministic, while higher temperature flattens the distribution and increases diversity.
Key-Value (KV) CachingKey-value caching stores the attention keys and values from earlier tokens during autoregressive decoding so they do not need to be recomputed at every step. It speeds up generation dramatically, but the cache grows with sequence length and turns inference into a memory-management problem.
K-Means Objective Function (Inertia)The K-means objective, also called inertia, is the sum of squared distances from each point to its assigned cluster centroid. K-means greedily lowers that objective by alternating between reassigning points and recomputing centroids, though the result still depends on initialization because the problem is nonconvex.
DropoutDropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
Mini-batch gradient descentMini-batch gradient descent estimates the gradient on a small subset of training examples at each update instead of on the full dataset or a single example. It is the practical default in deep learning because it balances hardware efficiency with optimization noise.
Early stoppingEarly stopping regularizes training by halting optimization when validation performance stops improving and keeping the best checkpoint seen so far. It works because prolonged optimization can eventually fit noise or idiosyncrasies of the training set rather than signal.
HyperparameterA hyperparameter is a setting chosen outside the optimization loop, such as learning rate, model width, regularization strength, or batch size. Unlike learned parameters, hyperparameters govern how the model is trained or structured and are usually selected by validation.
Data leakageData leakage occurs when information that would not be available at prediction time leaks into training or model selection, causing overly optimistic evaluation. Common examples are fitting preprocessing on the full dataset, peeking at test labels, or using future information in time-series tasks.
Learning rateThe learning rate is the scalar that sets how large each optimization step is when parameters are updated. If it is too high training can diverge or oscillate, and if it is too low training can become extremely slow or get stuck.
EpochAn epoch is one complete pass through the training dataset. In mini-batch training, an epoch consists of many updates, one for each batch needed to cover the data once.
TokenizationTokenization is the process of splitting raw text into model-readable tokens such as words, subwords, bytes, or characters. It determines vocabulary size, sequence length, and how efficiently a language model handles rare words, multilingual text, and code.
SubwordA subword is a token unit smaller than a full word but larger than a character, learned to balance vocabulary size against sequence length. Subword tokenization lets models handle rare and novel words by composing them from reusable pieces.
VocabularyA vocabulary is the fixed set of tokens a tokenizer can map text into and a model can natively process. Its size trades off compression against flexibility: larger vocabularies shorten sequences, while smaller ones rely more on subword or byte composition.
Chat language modelA chat language model is a pretrained LLM further tuned to follow instructions and handle multi-turn dialogue. It is usually built by supervised fine-tuning plus preference optimization or RLHF, so it behaves more helpfully and safely than the raw base model.
Attention MaskAn attention mask is a tensor that tells an attention layer which positions may interact and which must be blocked. It is used for causal generation, padding suppression, and task-specific visibility patterns, and it must be applied before softmax, not after.
Grouped-Query Attention (GQA)Grouped-query attention shares key and value heads across groups of query heads, reducing KV-cache size and bandwidth during inference. It sits between full multi-head attention and multi-query attention, preserving most quality while making long-context serving cheaper.
Root Mean Square Normalization (RMSNorm)RMSNorm normalizes activations by their root mean square without subtracting the mean. Compared with LayerNorm it is slightly cheaper and often just as effective, which is why many modern LLMs use RMSNorm in place of full mean-and-variance normalization.
System PromptA system prompt is a high-priority instruction block that defines the assistant's role, rules, and behavioral constraints for a conversation. It is usually prepended invisibly to user messages and is intended to override lower-priority prompt content.
Prompting FormatPrompting format is the template used to serialize instructions, roles, examples, and conversation turns into the token sequence a model expects. It matters because the same words in a different format can change model behavior, especially for chat-tuned systems.
Prompt EngineeringPrompt engineering is the practice of designing prompts that make a model reliably produce the desired behavior. It includes choosing instructions, examples, structure, and reasoning scaffolds, and it trades parameter updates for careful interface design.
Function CallingFunction calling is a language-model capability for producing structured tool invocations instead of only plain text. The model selects a function and arguments that match a schema, which makes tool use more reliable and easier to integrate with software systems.
Supervised Fine-Tuning (SFT)Supervised fine-tuning trains a pretrained model on curated input-output pairs so it follows instructions, styles, or task formats more reliably. In chat systems, SFT is the stage that turns a raw completion model into an assistant before preference alignment is applied.
FinetuningFinetuning continues training a pretrained model on a smaller task-specific or domain-specific dataset. It adapts existing representations rather than learning from scratch, which is why it usually needs far less data and compute than pretraining.
Full Fine-TuneA full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
Parameter-Efficient Fine-Tuning (PEFT)PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
Low-Rank Adaptation (LoRA)LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
LoRA AdapterA LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
Sampling (in Language Models)Sampling in language models means selecting the next token from the predicted probability distribution instead of always taking the argmax. The decoding rule strongly shapes diversity, coherence, and repetition, which is why temperature, top-k, and top-p matter so much.
Greedy DecodingGreedy decoding always selects the highest-probability next token at each step. It is simple and deterministic, but it often gets trapped in bland or repetitive continuations because it never explores slightly less probable alternatives that might lead to better sequences.
Top-k SamplingTop-k sampling truncates the next-token distribution to the \( k \) most probable tokens, renormalizes, and samples from that set. It removes the low-probability tail that often contains junk while still allowing controlled randomness.
Top-p Sampling (Nucleus Sampling)Top-p, or nucleus, sampling chooses the smallest set of tokens whose cumulative probability exceeds a threshold \( p \), then samples from that adaptive set. Unlike top-k, it expands when the model is uncertain and shrinks when the distribution is sharp.
Frequency PenaltyA frequency penalty subtracts an amount proportional to how many times a token has already appeared, lowering its future logit more with each repetition. It encourages lexical diversity without banning reuse entirely, which makes it gentler than hard repetition constraints.
Presence PenaltyA presence penalty lowers the score of any token that has already appeared, encouraging the model to introduce new words or topics instead of repeating earlier ones. Unlike a frequency penalty, it depends only on whether the token has appeared at least once, not on how many times it appeared.
Retrieval-Augmented Generation (RAG)Retrieval-augmented generation adds a retrieval step so the model conditions on external documents at inference time instead of relying only on memorized parameters. It can improve freshness and grounding, but answer quality depends heavily on retrieval recall, ranking, chunking, and how well the model uses the retrieved evidence.
Semantic SearchSemantic search retrieves results by meaning rather than exact keyword overlap, usually by embedding queries and documents into a vector space and comparing similarity. It handles paraphrases well, but it is often combined with lexical search when exact terms or identifiers matter.
Model MergingModel merging combines the weights or weight deltas of separately fine-tuned models into one model without full retraining. It can create multitask behavior cheaply, but naive averaging often causes interference unless the merged models share a common base and compatible parameter geometry.
Model SoupsModel soups average the weights of multiple fine-tuned models that lie in the same low-loss basin, often improving accuracy and robustness without extra inference cost. Unlike an ensemble, a soup is still a single model at serving time.
SLERP (Spherical Linear Interpolation)SLERP interpolates between two parameter vectors along the great-circle path on a sphere instead of using straight-line interpolation. In model merging it can preserve vector norms and sometimes produce smoother blends than linear interpolation when the two directions differ strongly.
TIES-MergingTIES-Merging is a model-merging method that reduces interference by trimming small parameter changes, resolving sign conflicts, and merging only updates aligned with an agreed sign. It is designed for cases where separately fine-tuned models disagree on which weights should move and in what direction.
DARE (Drop And REscale)DARE sparsifies fine-tuning deltas by randomly dropping many update entries and rescaling the rest, preserving most behavior while greatly reducing storage and merge interference. It is commonly used as a preprocessing step for delta compression or model merging rather than as a standalone training method.
Model CompressionModel compression reduces a model’s memory, latency, or energy cost while trying to preserve performance. Common compression methods include distillation, pruning, quantization, low-rank factorization, and architecture redesign.
Knowledge DistillationKnowledge distillation trains a smaller student model to match the outputs or internal representations of a larger teacher. It transfers some of the teacher’s behavior into a cheaper model, often using soft targets that contain more information than hard labels alone.
PruningPruning removes weights, neurons, heads, or entire blocks that contribute little to performance. It can reduce model size or compute, but aggressive pruning usually needs fine-tuning to recover lost accuracy.
Structured PruningStructured pruning removes whole channels, heads, layers, or blocks, producing regular sparsity that hardware can exploit directly. It usually yields better real-world speedups than unstructured pruning, though it gives less fine-grained control.
Unstructured PruningUnstructured pruning zeros individual weights wherever they appear unimportant, creating irregular sparsity patterns. It can achieve high compression ratios, but specialized kernels are often needed to turn that sparsity into real latency gains.
QuantizationQuantization stores or computes with lower-precision numbers, such as INT8 or 4-bit values, instead of full-precision floats. It reduces memory bandwidth and can speed inference, but accuracy depends on how well the lower-precision representation preserves weights and activations.
Post-Training Quantization (PTQ)Post-training quantization converts a trained model to lower precision after optimization is finished, usually using a calibration set to estimate activation ranges. It is easy and cheap to apply, but accuracy can drop more than with quantization-aware training at very low bit widths.
Quantization-Aware Training (QAT)Quantization-aware training simulates low-precision arithmetic during training so the model learns weights that remain accurate after quantization. It usually outperforms post-training quantization at low bit widths, but it adds training cost and implementation complexity.
Program-Aided Language ModelA program-aided language model uses the LLM to translate a problem into executable code, then lets an interpreter carry out the exact computation. This separates natural-language understanding from symbolic execution and often improves arithmetic or algorithmic reasoning over pure chain-of-thought.
FlashAttentionFlashAttention is an exact attention algorithm that uses tiling and kernel fusion to minimize reads and writes between GPU HBM and on-chip SRAM. It preserves standard attention outputs while greatly reducing memory traffic, which yields large speed and memory gains on long sequences.
Data ParallelismData parallelism replicates the model on multiple devices and splits each batch across them, synchronizing gradients after each step. It is the simplest way to scale training throughput, but every device still stores the full model unless sharding is added.
Model ParallelismModel parallelism splits one model across multiple devices because it is too large or compute-heavy for a single device. The split can happen by layers, tensors, experts, or sequence chunks, trading memory savings for extra communication.
Pipeline ParallelismPipeline parallelism partitions a model by layers across devices and sends microbatches through the partitions like an assembly line. It reduces per-device memory, but pipeline bubbles and stage imbalance can waste throughput if the schedule is poorly tuned.
Tensor ParallelismTensor parallelism shards individual large matrix operations across devices, such as splitting weight matrices by rows or columns. It is effective for very large Transformers, but the frequent collectives mean fast interconnects are important.
Context ParallelismContext parallelism distributes a long sequence across devices so context tokens and their attention-related work are sharded instead of fully replicated. It helps long-context training or inference scale beyond one device, but requires extra communication to preserve exact attention across chunks.
Fully Sharded Data Parallel (FSDP)Fully Sharded Data Parallel shards model parameters, gradients, and optimizer states across data-parallel workers, gathering full parameters only when needed for computation. It is the PyTorch analogue of ZeRO-style training and makes much larger models fit without custom model-parallel code.
Model ShardingModel sharding splits a model’s parameters across devices or storage tiers instead of keeping a full copy everywhere. It is a general systems technique used in tensor parallelism, FSDP, offloading, and large-model serving to reduce per-device memory requirements.
Mixed Precision TrainingMixed-precision training performs most computation in lower precision such as FP16 or bfloat16 while keeping selected quantities in higher precision for stability. It reduces memory use and often increases throughput without much accuracy loss when implemented carefully.
Long-Context PretrainingLong-context pretraining trains or continues training a model on examples with much longer sequences so it learns to use distant context instead of only fitting short windows. It is usually needed because simply changing positional scaling or the context limit does not teach robust long-range retrieval or reasoning.
Token IDA token ID is the integer index assigned to a token after tokenization. Models do not operate on raw text directly; they look up embeddings from token IDs and later map output logits back to IDs during decoding.
Vocabulary SizeVocabulary size is the number of distinct tokens a tokenizer can emit. A larger vocabulary shortens sequences but increases embedding and softmax size, while a smaller vocabulary produces longer sequences and more token fragmentation.
Subword TokenizationSubword tokenization splits text into frequent pieces smaller than words but larger than individual characters. It handles rare words and open vocabularies well by composing unfamiliar words from known subword units.
Special TokensSpecial tokens are reserved tokens with structural or control meaning, such as BOS, EOS, PAD, SEP, or mask tokens, rather than ordinary text content. They shape formatting, training objectives, and sometimes model behavior.
Padding TokenA padding token is a dummy token added so sequences in a batch have equal length. It should be ignored by the loss and usually masked from attention so it does not behave like real context.
BOS Token (Beginning of Sequence)A BOS token marks the beginning of a sequence and gives the model a consistent start symbol for conditioning generation or encoding. It can help define sequence boundaries and sometimes carries special training semantics.
EOS Token (End of Sequence)An EOS token marks the end of a sequence and tells the model where generation should stop. During training it teaches sequence termination, and during inference it is one of the main stopping conditions.
Nearest Neighbor SearchNearest neighbor search finds the stored vectors most similar to a query under a chosen distance or similarity metric. Exact search is simple but expensive at scale, so large systems often use approximate nearest neighbor indexes instead.
Sparse RetrievalSparse retrieval represents queries and documents with sparse term-based features such as inverted indexes, TF-IDF, or BM25. It excels at exact keywords and rare identifiers, but is weaker than dense retrieval on paraphrases and semantic matching.
Dense RetrievalDense retrieval represents queries and documents with learned dense embeddings and retrieves by vector similarity. It handles paraphrase and semantic matching better than sparse retrieval, but it can miss exact lexical constraints and usually relies on approximate nearest neighbor search.
Temperature ScalingTemperature scaling calibrates a classifier by dividing logits by a learned scalar temperature before the softmax. It often improves probability calibration on a validation set without changing the model’s ranking of classes.
Batch NormalizationBatch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
Layer NormalizationLayer normalization normalizes activations across features within each example rather than across the batch. It works well for variable-length sequences and small batch sizes, which is why it is standard in Transformers.
Gradient ClippingGradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
Weight DecayWeight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
Adam OptimizerAdam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
RMSPropRMSProp uses an exponential moving average of squared gradients to normalize updates, preventing the denominator from growing without bound as in AdaGrad. It is useful for nonstationary problems and was a key precursor to Adam.
Learning Rate ScheduleA learning-rate schedule changes the learning rate over training instead of keeping it constant. Schedules matter because they balance fast early progress with stable late optimization and often determine final performance as much as the base optimizer.
WarmupWarmup starts training with a small learning rate and gradually increases it during the first steps. It reduces early instability, especially in Transformers where large updates before optimizer statistics settle can cause divergence.
Cosine AnnealingCosine annealing decays the learning rate following a cosine curve from a high value to a low value, sometimes with restarts. It provides a smooth schedule that often works well in practice without needing many hand-tuned decay boundaries.
Label SmoothingLabel smoothing replaces hard one-hot targets with a mostly-correct probability distribution that assigns a small amount of mass to other classes. It regularizes overconfident classifiers and often improves generalization and calibration, though it can hurt when exact probabilities matter.
Tokenization PipelineA tokenization pipeline is the full process that turns raw text into model-ready inputs, including normalization, pre-tokenization, subword splitting, token-to-ID mapping, truncation, padding, and special-token insertion. Choices here directly affect sequence length, vocabulary coverage, and downstream behavior.
GPU AccelerationGPU acceleration uses highly parallel graphics processors to speed the matrix and tensor operations that dominate modern ML workloads. It matters because deep learning is mostly throughput-bound linear algebra, which GPUs execute far more efficiently than general-purpose CPUs.
CUDACUDA is NVIDIA’s parallel-computing platform and programming model for running general-purpose kernels on GPUs. In machine learning it is the software layer that makes GPU-accelerated training and inference practical, exposing massive parallelism, specialized libraries, and direct control over device memory.
Inference OptimizationInference optimization is the set of techniques that reduce serving latency, memory use, and cost while preserving acceptable quality. Common methods include quantization, batching, KV-cache optimizations, kernel fusion, speculative decoding, and architecture choices that trade a little flexibility for much higher throughput.
Instruction TuningInstruction tuning is supervised fine-tuning on instruction-response examples so a pretrained model learns to follow requests instead of merely continuing text. It improves task generality and usability, but it mainly changes behavior and format-following rather than adding much new world knowledge.
Safety AlignmentSafety alignment is the process of making a model reliably avoid harmful, deceptive, or policy-violating behavior while remaining useful. In practice it combines data curation, supervised tuning, preference optimization or RLHF, classifiers, and adversarial evaluation, but it never guarantees perfect safety.
Red-TeamingRed-teaming is adversarial evaluation in which people or automated systems deliberately try to break a model’s safeguards and expose failure modes. Its purpose is not to improve benchmark scores directly, but to find unsafe or brittle behavior before deployment.
What is a jailbreak in the context of LLMs?In the context of LLMs, a jailbreak is a prompt or interaction pattern that bypasses the model’s safety training or policy enforcement and elicits behavior it was supposed to refuse. Jailbreaks matter because they reveal that aligned behavior can be a thin behavioral layer rather than a deep guarantee.
Adversarial PromptingAdversarial prompting is the deliberate construction of inputs that push a model toward incorrect, unsafe, or unintended behavior. It includes jailbreaks, prompt injection, data exfiltration attempts, and other attacks that exploit weaknesses in instruction-following or context handling.
Data AugmentationData augmentation expands a training set with label-preserving transformations such as crops, paraphrases, or noise injection. It improves generalization by teaching the model which variations should not change the answer.
Classification HeadA classification head is the final task-specific layer that maps learned representations to class logits or probabilities. In transfer learning it is often the only part trained from scratch, while the backbone provides reusable features.
Softmax HeadA softmax head is the output projection plus softmax normalization that converts hidden representations into a probability distribution over classes or vocabulary items. In language models it is the layer that turns the final hidden state into next-token probabilities.
Beam SearchBeam search is a decoding algorithm that keeps the top-scoring partial sequences at each step instead of only the single best one. It approximates high-probability generation better than greedy decoding, but it can still miss the global optimum and often reduces diversity.
Repetition PenaltyA repetition penalty is a decoding heuristic that downweights tokens or phrases the model has already used, reducing loops and bland repetition. It improves generation quality when a model is prone to degeneracy, but too much penalty can make text unnatural or incoherent.
Logit AdjustmentLogit adjustment means modifying logits to account for effects such as class imbalance, prior shift, or calibration goals before taking probabilities or losses. It changes the decision boundary in a simple way by shifting scores rather than changing the underlying representation.
Domain-Specific PretrainingDomain-specific pretraining continues or repeats pretraining on corpus data from a specialized domain such as law, medicine, or code. It improves vocabulary use, factual recall, and style in that domain, but it can also narrow the model or erode performance outside the target distribution.
Distributed Computing (ML Training)Distributed computing in ML training spreads computation, memory, or both across many devices and often many machines. It is what makes modern large-model training possible through strategies such as data parallelism, model parallelism, sharding, and pipeline execution.
Memory Optimization (ML Training)Memory optimization in ML training is the collection of techniques that reduce peak memory so larger models or batches fit on available hardware. Common examples are mixed precision, activation checkpointing, optimizer sharding, offloading, and more memory-efficient attention kernels.
Role-Playing (LLMs)Role-playing in LLMs means conditioning a model to adopt a persona, voice, or behavioral frame during generation. It is useful for simulation and product design, but it also shows how easily high-level behavior can be steered by prompt context.
Data MixtureA data mixture is the weighting and composition of different datasets or domains in a training run. It matters because capability, robustness, and bias often depend as much on what proportion of the data comes from each source as on the total token count.
Log-Sum-Exp TrickThe log-sum-exp trick computes expressions like log(sum(exp(x_i))) stably by subtracting the maximum logit before exponentiation. It prevents overflow and underflow, so it is a standard numerical tool in softmax, cross-entropy, and probabilistic inference.
k-Nearest Neighbors (k-NN)k-nearest neighbors predicts by finding the k closest labeled training points to a query and then voting or averaging their labels. It has almost no training phase, but its predictions depend heavily on the distance metric and it degrades in high-dimensional spaces.
YaRN / NTK-aware RoPE ScalingYaRN and other NTK-aware RoPE-scaling methods extend the usable context of RoPE-based models by rescaling or interpolating rotary frequencies rather than retraining the model from scratch. Their goal is to preserve short-context behavior while making long-range positions less distorted.
Multi-head Latent Attention (MLA)Multi-head Latent Attention compresses the keys and values of multi-head attention into a smaller latent representation before use. Its main advantage is a much smaller KV cache and lower decode-time memory bandwidth, which is why it is attractive for long-context serving.
Pretraining Data DeduplicationPretraining data deduplication removes near-duplicate documents or passages from a training corpus. It improves per-token efficiency and reduces memorization and benchmark contamination, because repeatedly seeing the same text usually wastes compute more than it adds knowledge.
vLLM & Continuous BatchingvLLM is an LLM serving system built around PagedAttention and continuous batching. Instead of waiting for a batch to finish, it admits and schedules requests at each decoding step, which reduces padding waste and improves throughput for variable-length generations.
GPTQ QuantizationA one-shot, layer-wise post-training quantization that pushes LLM weights to 3–4 bits while preserving generation quality. GPTQ reformulates quantization as a per-row error-minimisation problem solved greedily using the inverse Hessian of a small calibration set, achieving 3-bit LLaMA-65B with <1% perplexity loss.
AWQ (Activation-aware Weight Quantization)AWQ is a post-training quantization method that preserves quality by protecting the weights attached to the largest activation channels before rounding. It targets low-bit LLM inference with small accuracy loss and is popular because it is simpler to deploy than Hessian-based methods such as GPTQ.
xFormers / Memory-Efficient AttentionA library / pattern of attention implementations that avoid materialising the \( n \times n \) attention matrix, reducing memory from \( O(n^2) \) to \( O(n) \). xFormers bundles FlashAttention, Memory-Efficient Attention (Rabe & Staats), block-sparse variants, and ALiBi/RoPE patches under a unified API — a precursor to the default attention kernels shipped in PyTorch 2.0+.
KV Cache Compression (H2O, SnapKV)Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
Chunked PrefillA serving-time technique that breaks the long prefill of a prompt into small chunks and interleaves them with decode steps of other requests. By keeping GPU utilisation high during prefill and avoiding long tail latencies, chunked prefill dramatically improves throughput in mixed-batch LLM serving.
Disaggregated Prefill/Decode ServingDisaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.
Paged vs Block KV CacheTwo allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
Tensor Cores & GEMM FundamentalsTensor cores are specialised matrix-multiply units in NVIDIA GPUs (introduced in Volta, 2017) that execute small mixed-precision matrix-multiply-accumulate (MMA) ops per clock. Peak DL throughput is set by tensor-core flops; structured matrix multiplies (GEMM) that tile the problem to tensor-core shape are how deep learning touches that ceiling.
BF16 / FP8 / MXFP4 Number FormatsThese are low-precision number formats used to trade numerical precision for speed and memory efficiency in modern ML hardware. BF16 is the standard training workhorse, FP8 is increasingly used for faster training and inference, and 4-bit floating formats push efficiency further for aggressive inference optimization.
DeepSpeed ZeRO-Infinity / OffloadingAn extension of ZeRO that offloads optimizer states, gradients, and parameters to CPU RAM and NVMe SSDs, enabling training of trillion-parameter models on modest GPU clusters. ZeRO-Infinity (Rajbhandari et al., 2021) uses bandwidth-aware partitioning and overlap to hide the offload latency.
Ring Attention / Context Parallel for Long SequencesA distributed-attention algorithm that shards an \( n \)-token sequence across \( P \) devices and computes each attention output via a ring of key-value rotations. Ring Attention (Liu et al., 2023) enables context lengths of millions of tokens on multi-GPU clusters with near-linear scaling.
Rejection Sampling Fine-Tuning (RFT)An offline alignment method: sample many completions from a base model, keep only those that pass a quality / reward filter, and fine-tune on the survivors. RFT is the simplest way to convert a reward model (or a verifier) into improved policy behaviour, and serves as a strong baseline against DPO / PPO.
Self-Play Fine-Tuning (SPIN)Chen et al. (2024) cast fine-tuning as a game between the current model and its previous iterate: the new model must distinguish its own generations from demonstrations, improving it without any new human data. SPIN delivers substantial improvements on supervised data alone, bootstrapping from a weak SFT model toward a stronger one.
Iterative DPOIterative DPO repeats a simple loop: sample responses from the current policy, score or label them, apply a DPO-style update, and repeat. It brings online data collection to preference optimization while keeping DPO's simpler training dynamics compared with PPO-based RLHF.
Best-of-N Sampling and its ScalingAn inference-time boost: sample \( N \) responses from a language model, score them with a reward model or verifier, and return the best. Quality scales with \( \log N \); scaling laws predict the knob where best-of-\( N \) inference compute equals extra training compute, and motivate distillation of best-of-\( N \) behaviour back into the policy via RFT.
Feature Attribution (Integrated Gradients, SHAP)Post-hoc interpretability methods that attribute a model's prediction to its input features. Integrated Gradients (Sundararajan et al., 2017) integrates gradients along a path from a baseline to the input. SHAP (Lundberg & Lee, 2017) builds on Shapley values from cooperative game theory, giving a unique attribution with axiomatic fairness properties.
t-SNE (t-Distributed Stochastic Neighbor Embedding)t-SNE embeds high-dimensional points into 2D/3D by matching a heavy-tailed joint \( Q \) in the embedding space to a Gaussian-kernel joint \( P \) over neighbourhoods in the input space, minimising \( D_{\text{KL}}(P\|Q) \). The heavy Student-t tail in \( Q \) solves the crowding problem; perplexity tunes neighbourhood size. t-SNE preserves local structure well but distorts global distances.
UMAP (Uniform Manifold Approximation and Projection)UMAP constructs a fuzzy topological representation of the data using local Riemannian metrics, then finds a low-dimensional embedding whose fuzzy topology matches. Its cross-entropy objective has distinct attractive and repulsive terms — easier to scale than t-SNE — and a theoretical motivation grounded in category theory and Riemannian geometry. In practice, UMAP preserves more global structure than t-SNE at similar or better speed.
DBSCAN (Density-Based Clustering)DBSCAN clusters points by density: a point is a core point if it has at least \( \text{minPts} \) neighbours within radius \( \varepsilon \); clusters are connected components of core-point neighbourhoods. Non-core points without core neighbours are labelled noise. Unlike \( k \)-means, DBSCAN does not need the number of clusters in advance and handles arbitrary cluster shapes, at the cost of \( \varepsilon \) tuning.
Matrix Factorisation and Collaborative FilteringMatrix factorisation decomposes a sparse user–item rating matrix \( R \approx UV^\top \) into low-rank user and item embeddings, learned by minimising squared error on observed entries with regularisation. This is the canonical collaborative-filtering recipe behind Netflix-style recommenders, the spiritual ancestor of modern embedding-based retrieval, and a clean worked example of low-rank learning.
Deep Q-Network (DQN)DQN parameterises \( Q(s, a; \theta) \) by a CNN and trains it by Q-learning on mini-batches sampled from a replay buffer, with a slowly-updated target network stabilising the bootstrapped target. Mnih et al. (2015) showed a single DQN architecture learning 49 Atari games from raw pixels at human-level on many — the empirical breakthrough that ignited modern deep RL.
WordPiece, Unigram, and SentencePiece TokenisationWordPiece, Unigram, and SentencePiece are subword tokenization schemes used to balance vocabulary size against sequence length. WordPiece builds a vocabulary greedily, Unigram prunes a probabilistic vocabulary by likelihood, and SentencePiece is the language-agnostic toolkit that trains and applies these methods on raw text.
Prefix-Tuning and Prompt-TuningPrompt-tuning and prefix-tuning freeze the base model and learn only a small set of continuous prompt parameters. Prompt-tuning adds learned embeddings at the input, while prefix-tuning adds learned key-value prefixes inside each attention layer.
Adapter Layers (Houlsby-style)Houlsby-style adapters are small bottleneck MLPs inserted inside each Transformer block while the original backbone is frozen. They were the first clean PEFT recipe: add a few trainable parameters per layer, keep the base model unchanged, and specialize the model to new tasks cheaply.
GPU Memory Hierarchy and the Roofline ModelThe GPU memory hierarchy ranges from slow, large global memory to faster on-chip caches, shared memory, and registers. The roofline model explains performance by comparing arithmetic intensity with hardware limits: low-intensity kernels are memory-bound, while high-intensity kernels are compute-bound.
NCCL Collectives: All-Reduce, All-Gather, Reduce-ScatterNCCL (NVIDIA Collective Communications Library) implements ring and tree algorithms for GPU-to-GPU communication primitives — the building blocks of distributed deep learning. All-reduce sums gradients across workers; reduce-scatter + all-gather are the two halves of all-reduce, exploited by ZeRO and FSDP to shard parameters. Understanding these primitives is essential to reasoning about training scalability.
Arithmetic IntensityArithmetic intensity is the number of floating-point operations performed per byte of data moved from memory. In the roofline model it determines whether a kernel is memory-bound or compute-bound, which is why matmuls are efficient and elementwise ops often are not.
Triton GPU Kernel ProgrammingTriton (OpenAI, 2019) is a Python-embedded DSL for writing high-performance GPU kernels. It exposes tile-level primitives (block pointers, block-scoped arithmetic, automatic vectorisation) while hiding shared-memory management and thread-level scheduling. FlashAttention-2, Mamba kernels, many PyTorch 2 inductor-generated kernels, and most modern custom ops are written in Triton — fast enough to match hand-tuned CUTLASS with far less code.
HNSW (Hierarchical Navigable Small World)HNSW is the graph-based approximate nearest-neighbour (ANN) algorithm powering most production vector databases. It maintains a multi-layer proximity graph where each layer is a small-world graph at a different density; search descends from sparse top layers to dense bottom layers by greedy edge-following. Query cost is \( O(\log n) \); recall-latency trade-off is tunable per query.
ColBERT and Late-Interaction RetrievalColBERT (Khattab & Zaharia, 2020) represents each document as a bag of contextualised token vectors rather than a single vector, and scores against a query via MaxSim : for each query token, find its best-matching document token and sum the similarities. Late interaction preserves token-level granularity that pooling destroys, closing the quality gap between dense retrievers and cross-encoders at a fraction of the cost.
XGBoost, LightGBM, and CatBoostThree production gradient-boosted-decision-tree implementations with distinct tree-construction strategies: XGBoost does level-wise exact/approximate splits with second-order Taylor objective; LightGBM uses histogram-based leaf-wise growth and GOSS subsampling; CatBoost uses ordered boosting to avoid target leakage with categorical features.
YOLO Family (v1 – v10)Single-stage detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from a single CNN forward pass. YOLOv1 was real-time but coarse; YOLOv3–v10 progressively adopted anchor boxes, FPN, CSP blocks, decoupled heads, and finally anchor-free / NMS-free designs for edge deployment.
Lion OptimizerA momentum-only optimizer discovered by neural search: updates are \( \theta \leftarrow \theta - \eta \, \text{sign}(\beta_1 m + (1-\beta_1) g) \). Uses only the sign of a momentum estimate — no second moment, half the state of AdamW, often matches or beats it on large-scale training with smaller learning rates.
AdaFactor OptimizerA memory-efficient Adam variant. Replaces the full second-moment matrix with row and column running averages (a rank-1 factorisation): memory is \( O(m + n) \) instead of \( O(mn) \). Used by T5, Meta's BlenderBot, and many large-scale models to free memory for bigger batches and longer contexts.
Focal Loss & Class-Imbalance ObjectivesFocal loss \( \text{FL}(p_t) = -(1-p_t)^\gamma \log p_t \) down-weights the loss contribution of confidently-classified easy examples, focusing gradient on hard ones. Designed for extreme foreground-background imbalance in one-stage detection; widely used in segmentation and long-tailed classification.
Gradient Accumulation & Micro-BatchingSplit a large effective batch into \( k \) micro-batches; accumulate their gradients in a buffer, then step once. Decouples statistical batch size from hardware batch size, enabling \( N\cdot k \) effective batch without a \( k\times \) memory blow-up. Essential for training large models on any GPU.
Data Curation & Quality Filters (FineWeb, Dolma)Modern pretraining pipelines filter terabytes of web data through language ID, heuristic rules (repetition, punctuation ratios), classifier-based quality scoring, and toxicity / PII removal. The FineWeb and Dolma recipes document which filters mattered — often delivering per-token quality gains equivalent to 2–3× scale-up.
MinHash & LSH for Large-Scale DeduplicationMinHash approximates Jaccard similarity between sets, and locality-sensitive hashing uses that approximation to quickly find near-duplicates. In ML data pipelines this is used to deduplicate documents or examples at very large scale.
Synthetic Data Generation for Post-TrainingModern instruction-tuning and RL-based alignment rely on LLM-generated synthetic data: self-instruct / Evol-Instruct expand seed prompts, teacher models produce high-quality completions, and process-reward models validate chain-of-thought steps. The backbone of the post-ChatGPT post-training stack.
Continual Pretraining & Mid-TrainingContinue pretraining an existing base model on a domain or task-focused corpus (code, math, a new language) before final post-training. Achieves domain gains that would cost 10× more to obtain by fine-tuning alone. Sits between pretraining and SFT in modern recipes.
Long-Context Data Recipes (RULER, Needle Variants)Extending effective context beyond 128k requires (a) RoPE-scaling or position-interpolation to keep positional encodings sane, (b) a continued-pretraining dataset with real long documents and synthetic stitched tasks, and (c) evaluation beyond simple needle-in-a-haystack — RULER adds multi-needle, multi-hop, and aggregation subtasks that expose superficial-match shortcuts.
Tokenizer Training Dynamics & Vocab SizingTokenizer design trades vocabulary size against sequence length and changes both compute cost and what patterns the model can represent cleanly. BPE, unigram, and byte-level schemes make different compromises, especially for code, multilingual text, and rare domain terms.
ORPO (Odds-Ratio Preference Optimization)A reference-model-free preference-optimisation objective that combines SFT and preference learning in one loss: \( \mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} - \lambda \log \sigma(\log \text{odds}(y_w) - \log \text{odds}(y_l)) \). Eliminates DPO's reference-model requirement, halving training memory.
SimPO (Simple Preference Optimization)Reference-free preference objective that replaces DPO's reference ratio with a length-normalised policy log-likelihood: \( r(x, y) = \log \pi_\theta(y\mid x) / |y| \). Adds a margin \( \gamma \) to the preferred response. Simpler than DPO, matches or beats it, and length-normalisation reduces verbosity exploitation.
Constrained Decoding: Grammars, JSON Mode, RegexGrammar-constrained decoding masks any next token that would violate a target grammar or schema. This guarantees outputs such as valid JSON, XML, or regex-matching strings by restricting generation to the language accepted by a finite-state machine or pushdown automaton.
Model Context Protocol (MCP)Model Context Protocol is an open client-server protocol for connecting models and agents to external tools, resources, and prompts through a common interface. It standardizes capability discovery and tool invocation so one MCP-aware client can talk to many different servers without bespoke integrations.
FlashAttention-2 and FlashAttention-3FlashAttention-2 and FlashAttention-3 are follow-on attention kernels that keep exact attention outputs while running much faster through better tiling, parallelism, and data movement. FA-2 improves work partitioning on modern GPUs, while FA-3 adds Hopper-specific asynchronous pipelines and low-precision support.
Medusa & EAGLE Speculative-Decoding HeadsSpeculative decoding with learned draft heads instead of a separate draft model. Medusa adds \( K \) small lightweight heads on the base model predicting future tokens; the base verifies tree-hypotheses in one forward pass. EAGLE models the residual stream directly and achieves 3–4× speedup with a tiny draft network.
Lookahead DecodingExact speculative-like decoding without any draft model: maintain a running n-gram 'Jacobi window' that proposes multiple tokens ahead in parallel; verify in one pass. Lossless — outputs match greedy decoding exactly — and requires no extra training. Trades increased per-step compute for fewer sequential steps.
Radix / Prefix-Cache Attention (SGLang)Share the KV cache across requests that start with a common prompt prefix. Store prefix trees keyed by token sequence; on a new request, find the longest matching prefix in the cache and reuse it. Cuts prefill latency and memory use for chat applications with shared system prompts or few-shot contexts.
Quantized KV Cache (int4 / int8 / KIVI)Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
Continuous vs Static BatchingStatic batching groups requests before a forward pass and runs them to completion together — tail latency is set by the slowest request. Continuous batching (Orca, vLLM) evicts finished requests mid-step and admits new ones each iteration, keeping GPU utilisation high and tail latency bounded. Default in production LLM serving.
Federated Learning (FedAvg)Train a shared model across many clients (phones, hospitals) without centralising data. Each round: clients train locally for a few epochs, server averages their weight updates. Introduces non-IID-data, communication-cost, and privacy / security challenges absent in centralised training.
Gradient Noise ScaleA scalar diagnostic that estimates the largest useful batch size by comparing the variance of per-example gradients to the squared norm of the mean gradient: \( B_{\text{noise}} = \operatorname{tr}(\Sigma)/\|g\|^2 \). McCandlish et al. (2018) argue that returns to scaling batch size diminish sharply once \( B \gg B_{\text{noise}} \), giving a principled way to choose batch size during large-scale training.
Adaptive Gradient Clipping (AGC)A per-parameter clipping rule introduced by Brock et al. (2021) that bounds each weight's update by a fraction of the weight's own norm: clip \( g \) so \( \|g\|/\|w\| \le \lambda \). Unlike global-norm clipping, AGC scales naturally with parameter magnitude and made it possible to train Normalizer-Free Networks (NFNets) without batch normalisation while matching its training stability.
Self-Paced LearningA curriculum-learning variant where the model itself decides which examples are 'easy enough' to train on at the current step, by minimising a joint objective \( \sum_i v_i \ell_i - \lambda \sum_i v_i \) over both parameters \( \theta \) and per-example weights \( v_i \in \{0,1\} \) (or \( [0,1] \)). Kumar, Packer & Koller (2010) introduced it as a non-convex EM-style alternative to handcrafted curricula; \( \lambda \) is annealed from low (only easy examples) to high (all examples).
Loss Landscape VisualizationMethods for visualising high-dimensional loss surfaces by projecting parameters onto 1-D or 2-D subspaces. Goodfellow's linear interpolation (2014) plots loss along the line between two solutions; Li et al.'s filter normalisation (2018) plots loss in a 2-D plane spanned by random Gaussian directions normalised per-filter. The latter reveals that residual connections smooth the landscape and that flat minima correspond to wide bowls in the visualisation.
Gradient Surgery (PCGrad) for Multi-Task LearningWhen two task gradients in multi-task learning point in conflicting directions (negative cosine), they partially cancel each other and slow learning. Yu et al.'s PCGrad (2020) projects each task gradient onto the normal plane of any conflicting task's gradient before summing, removing the destructive component. This 'gradient surgery' restores monotone progress on both tasks at modest cost.
Non-Contrastive SSL (BYOL, SimSiam, Barlow Twins, VICReg)Non-contrastive self-supervised learning learns representations from multiple views of the same example without explicit negative pairs. Methods such as BYOL, SimSiam, Barlow Twins, and VICReg avoid collapse using asymmetry, stop-gradient, redundancy reduction, or variance-preserving terms instead of contrastive negatives.
Sparse Representations in Deep NetsA sparse representation is one where only a small fraction of units are active for any given input. Deep nets often develop sparsity through ReLU-like nonlinearities or explicit penalties, which can improve efficiency, feature selectivity, and sometimes interpretability.
Metric Learning at ScaleMetric learning at scale trains embeddings so similar items are close and dissimilar items are far apart even when the dataset is too large for naive pairwise or triplet mining. The main challenge is finding informative negatives and keeping computation manageable as the corpus grows.
Memory in LLMsAn LLM's working memory is the KV-cache for the current context window; longer-term memory is implemented externally by retrieval over vector / hybrid stores, by writing to scratchpads or tool state, or by parameter updates (continual fine-tuning). Each option has a different latency, capacity, and forgetting profile; production systems combine all three.
Compressed Sparse Attention (CSA)Compressed Sparse Attention (CSA) is a long-context attention scheme that first compresses the KV cache into block summaries and then performs sparse attention only over the top-k relevant compressed blocks. An added sliding-window branch preserves exact local dependencies, so CSA cuts both KV memory and long-context attention compute without collapsing into a purely local window.
Distributed Training (Data & Model Parallelism)Distributed training scales learning across many devices by splitting either the data, the model, or both. Data parallelism is simplest when the model fits on each device, while model parallel approaches such as tensor and pipeline parallelism are needed when parameters or activations are too large for one accelerator.
Efficient Inference (Distillation, Pruning)Efficient inference reduces latency, memory, or serving cost without retraining a model from scratch, often by distilling a smaller student or pruning unimportant weights and structures. Distillation transfers behavior; pruning removes computation, and the two are often combined with quantization for deployment-grade compression.
Train/Validation/Test SplitA train/validation/test split separates fitting, model selection, and final evaluation into different datasets. The test set is kept untouched until the end so it remains a credible estimate of out-of-sample performance.
k-Fold Cross-Validationk-fold cross-validation rotates a held-out fold through the dataset so every example is used for validation once and training the other times. It uses limited data efficiently for model selection, but it costs multiple training runs and must keep preprocessing inside each fold.
Feature Scaling and StandardizationFeature scaling rescales input dimensions to comparable magnitudes, while standardization specifically subtracts the training mean and divides by the training standard deviation. It matters because optimization, distances, and margins can otherwise be dominated by whichever feature uses the largest units.
Class Imbalance and ReweightingClass imbalance means some labels are much rarer than others, so an unweighted objective can be dominated by the majority class. Reweighting changes the loss or sampling scheme so rare classes exert more influence during training.
Target Leakage vs. Data LeakageData leakage is any contamination that lets training or validation use information that would not be available at prediction time. Target leakage is the specific case where features encode the label or a post-outcome proxy for it, so every target leakage problem is data leakage, but not every data leakage problem is target leakage.
Ablation Studies and Experimental ControlAn ablation study removes or alters one component of a system to measure how much that component actually contributes. Experimental control matters because an ablation is only informative when the comparison keeps everything else fixed, including data, tuning budget, and evaluation protocol.