Tag: path-scaling-advanced

223 topic(s)

Multimodal Secure AlignmentMultimodal secure alignment is the problem of making a model's safety behavior consistent across text, images, audio, and mixed-modal inputs. It matters because a model can reconstruct harmful intent across modalities or through images that evade text-only filters, so defenses must align the fused system rather than just one input channel.
Layer Dropping and Progressive Pruning (TrimLLM)Layer dropping and progressive pruning reduce inference cost by cutting transformer depth rather than shrinking every matrix. TrimLLM does this progressively for domain-specialized LLMs, exploiting the empirical fact that not all layers are equally important in a target domain and aiming to retain in-domain accuracy while reducing latency.
Constitutional Classifiers++Constitutional Classifiers++ is a production-oriented jailbreak defense that uses context-aware classifiers and a cascade of cheap and expensive checks to block harmful exchanges efficiently. The system is designed to keep refusal rates and serving cost low while still catching universal jailbreaks that earlier, response-only filters missed.
Continuous Thought Machines (CTM)Continuous Thought Machines are models that make neural timing and synchronization part of the representation, instead of treating layers as purely instantaneous mappings. They use neuron-level temporal processing and support adaptive compute, so the same model can stop early on easy inputs or continue reasoning on harder ones.
Mechanistic OOCR Steering VectorsMechanistic OOCR steering vectors are a proposed explanation for some out-of-context reasoning results: fine-tuning can act like adding an approximately constant steering direction to the residual stream, rather than learning a deeply conditional new algorithm. That helps explain why a tuned behavior can generalize far beyond the fine-tuning data and why injecting or subtracting the vector can often reproduce or remove it.
Critical Representation Fine-Tuning (CRFT)Critical Representation Fine-Tuning (CRFT) is a PEFT method that improves reasoning by editing a small set of causally important hidden states instead of updating model weights broadly. It identifies critical representations through information-flow analysis and learns low-rank interventions on those states while keeping the base model frozen.
Chain-of-Thought MonitorabilityChain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
ZeRO (Zero Redundancy Optimizer)ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and eventually parameters across data-parallel workers so each GPU no longer stores a full copy of the training state. This cuts memory dramatically and makes very large-model training feasible without requiring full model-parallel architectures.
T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
Sparsely-Gated Mixture of Experts (MoE)A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
Neural Turing Machine (NTM)A Neural Turing Machine augments a neural controller with a differentiable external memory that it can read from and write to using soft attention over memory locations. It was an early attempt to learn algorithm-like behavior such as copying and sorting while remaining trainable end to end.
Weight TyingWeight tying uses the same matrix for token embeddings and the output softmax projection, typically by setting the output weights to the transpose of the input embedding table. This cuts parameters and often improves language modeling by forcing input and output token representations to share geometry.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
PagedAttentionPagedAttention stores the KV cache in fixed-size non-contiguous blocks, like virtual-memory pages, instead of requiring one contiguous allocation per sequence. This largely removes fragmentation, enables prompt-prefix sharing, and is a key reason vLLM can serve many more concurrent requests.
Speculative DecodingSpeculative decoding speeds up autoregressive generation by letting a small draft model propose several tokens and then having the large target model verify them in parallel. With the rejection-sampling correction from the original algorithm, the output distribution remains exactly the same as sampling from the target model alone.
KL-Divergence Penalty in RLHFThe KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
Proximal Policy Optimization (PPO)Proximal Policy Optimization is a policy-gradient algorithm that improves a policy while clipping how far action probabilities can move from the previous policy in one update. In RLHF it is usually paired with a KL penalty so the model gains reward without drifting too far from a reference model.
AdamW OptimizerAdamW is Adam with decoupled weight decay: parameter shrinkage is applied directly to the weights instead of being mixed into the adaptive gradient update. This preserves the intended regularization effect and is why AdamW became the default optimizer for many Transformer models.
Byte Pair Encoding (BPE)Byte Pair Encoding is a subword tokenization method that repeatedly merges the most frequent adjacent symbols in a corpus. It builds a vocabulary between characters and whole words, which handles rare words better than word-level tokenization while keeping sequence lengths manageable.
Softmax TemperatureSoftmax temperature rescales logits before softmax to control randomness in the output distribution. Lower temperature makes probabilities sharper and decoding more deterministic, while higher temperature flattens the distribution and increases diversity.
Key-Value (KV) CachingKey-value caching stores the attention keys and values from earlier tokens during autoregressive decoding so they do not need to be recomputed at every step. It speeds up generation dramatically, but the cache grows with sequence length and turns inference into a memory-management problem.
K-Means Objective Function (Inertia)The K-means objective, also called inertia, is the sum of squared distances from each point to its assigned cluster centroid. K-means greedily lowers that objective by alternating between reassigning points and recomputing centroids, though the result still depends on initialization because the problem is nonconvex.
DropoutDropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
Mini-batch gradient descentMini-batch gradient descent estimates the gradient on a small subset of training examples at each update instead of on the full dataset or a single example. It is the practical default in deep learning because it balances hardware efficiency with optimization noise.
Early stoppingEarly stopping regularizes training by halting optimization when validation performance stops improving and keeping the best checkpoint seen so far. It works because prolonged optimization can eventually fit noise or idiosyncrasies of the training set rather than signal.
HyperparameterA hyperparameter is a setting chosen outside the optimization loop, such as learning rate, model width, regularization strength, or batch size. Unlike learned parameters, hyperparameters govern how the model is trained or structured and are usually selected by validation.
Data leakageData leakage occurs when information that would not be available at prediction time leaks into training or model selection, causing overly optimistic evaluation. Common examples are fitting preprocessing on the full dataset, peeking at test labels, or using future information in time-series tasks.
Learning rateThe learning rate is the scalar that sets how large each optimization step is when parameters are updated. If it is too high training can diverge or oscillate, and if it is too low training can become extremely slow or get stuck.
EpochAn epoch is one complete pass through the training dataset. In mini-batch training, an epoch consists of many updates, one for each batch needed to cover the data once.
xLSTMxLSTM is a family of modern LSTM variants that adds exponential gating and redesigned memory structures, including scalar-memory and matrix-memory forms, to make recurrent models more scalable. The goal is to keep LSTM-style recurrence while improving stability, parallelism, and long-context performance.
minLSTMminLSTM is a simplified LSTM variant designed to remove some of the sequential dependencies that make classical LSTMs expensive while keeping useful gating behavior. The result is a lighter recurrent block that can be trained more efficiently and scaled more easily.
TokenizationTokenization is the process of splitting raw text into model-readable tokens such as words, subwords, bytes, or characters. It determines vocabulary size, sequence length, and how efficiently a language model handles rare words, multilingual text, and code.
SubwordA subword is a token unit smaller than a full word but larger than a character, learned to balance vocabulary size against sequence length. Subword tokenization lets models handle rare and novel words by composing them from reusable pieces.
VocabularyA vocabulary is the fixed set of tokens a tokenizer can map text into and a model can natively process. Its size trades off compression against flexibility: larger vocabularies shorten sequences, while smaller ones rely more on subword or byte composition.
Chat language modelA chat language model is a pretrained LLM further tuned to follow instructions and handle multi-turn dialogue. It is usually built by supervised fine-tuning plus preference optimization or RLHF, so it behaves more helpfully and safely than the raw base model.
Attention MaskAn attention mask is a tensor that tells an attention layer which positions may interact and which must be blocked. It is used for causal generation, padding suppression, and task-specific visibility patterns, and it must be applied before softmax, not after.
Grouped-Query Attention (GQA)Grouped-query attention shares key and value heads across groups of query heads, reducing KV-cache size and bandwidth during inference. It sits between full multi-head attention and multi-query attention, preserving most quality while making long-context serving cheaper.
Root Mean Square Normalization (RMSNorm)RMSNorm normalizes activations by their root mean square without subtracting the mean. Compared with LayerNorm it is slightly cheaper and often just as effective, which is why many modern LLMs use RMSNorm in place of full mean-and-variance normalization.
System PromptA system prompt is a high-priority instruction block that defines the assistant's role, rules, and behavioral constraints for a conversation. It is usually prepended invisibly to user messages and is intended to override lower-priority prompt content.
Prompting FormatPrompting format is the template used to serialize instructions, roles, examples, and conversation turns into the token sequence a model expects. It matters because the same words in a different format can change model behavior, especially for chat-tuned systems.
Prompt EngineeringPrompt engineering is the practice of designing prompts that make a model reliably produce the desired behavior. It includes choosing instructions, examples, structure, and reasoning scaffolds, and it trades parameter updates for careful interface design.
Tree of ThoughtTree of Thought extends chain-of-thought by exploring multiple candidate reasoning paths, evaluating intermediate states, and searching over them with strategies such as BFS or DFS. It is useful when solving the task requires branching, backtracking, or comparing alternative partial plans.
Function CallingFunction calling is a language-model capability for producing structured tool invocations instead of only plain text. The model selects a function and arguments that match a schema, which makes tool use more reliable and easier to integrate with software systems.
Supervised Fine-Tuning (SFT)Supervised fine-tuning trains a pretrained model on curated input-output pairs so it follows instructions, styles, or task formats more reliably. In chat systems, SFT is the stage that turns a raw completion model into an assistant before preference alignment is applied.
FinetuningFinetuning continues training a pretrained model on a smaller task-specific or domain-specific dataset. It adapts existing representations rather than learning from scratch, which is why it usually needs far less data and compute than pretraining.
Full Fine-TuneA full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
Parameter-Efficient Fine-Tuning (PEFT)PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
Low-Rank Adaptation (LoRA)LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
LoRA AdapterA LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
Sampling (in Language Models)Sampling in language models means selecting the next token from the predicted probability distribution instead of always taking the argmax. The decoding rule strongly shapes diversity, coherence, and repetition, which is why temperature, top-k, and top-p matter so much.
Greedy DecodingGreedy decoding always selects the highest-probability next token at each step. It is simple and deterministic, but it often gets trapped in bland or repetitive continuations because it never explores slightly less probable alternatives that might lead to better sequences.
Top-k SamplingTop-k sampling truncates the next-token distribution to the \( k \) most probable tokens, renormalizes, and samples from that set. It removes the low-probability tail that often contains junk while still allowing controlled randomness.
Top-p Sampling (Nucleus Sampling)Top-p, or nucleus, sampling chooses the smallest set of tokens whose cumulative probability exceeds a threshold \( p \), then samples from that adaptive set. Unlike top-k, it expands when the model is uncertain and shrinks when the distribution is sharp.
Frequency PenaltyA frequency penalty subtracts an amount proportional to how many times a token has already appeared, lowering its future logit more with each repetition. It encourages lexical diversity without banning reuse entirely, which makes it gentler than hard repetition constraints.
Presence PenaltyA presence penalty lowers the score of any token that has already appeared, encouraging the model to introduce new words or topics instead of repeating earlier ones. Unlike a frequency penalty, it depends only on whether the token has appeared at least once, not on how many times it appeared.
MisalignmentMisalignment is the failure mode where optimizing a model for its training objective or proxy reward does not produce the behavior humans actually want. It includes problems like reward hacking, unsafe shortcuts, and goal pursuit that diverges from the intended specification.
Retrieval-Augmented Generation (RAG)Retrieval-augmented generation adds a retrieval step so the model conditions on external documents at inference time instead of relying only on memorized parameters. It can improve freshness and grounding, but answer quality depends heavily on retrieval recall, ranking, chunking, and how well the model uses the retrieved evidence.
Semantic SearchSemantic search retrieves results by meaning rather than exact keyword overlap, usually by embedding queries and documents into a vector space and comparing similarity. It handles paraphrases well, but it is often combined with lexical search when exact terms or identifiers matter.
Sparse Mixture-of-Experts (MoE) LayerA sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
Router NetworkA router network scores experts or computation paths for each token and decides where that token should be sent in a conditional-compute model such as an MoE. A good router improves specialization while avoiding collapsed routing, overload, and excessive communication.
Expert NetworkAn expert network is one of the specialized submodules inside an MoE layer that processes only the tokens routed to it. Experts usually share the same architecture but learn different functions, so specialization emerges from routing plus load-balancing constraints.
Top-k RoutingTop-k routing sends each token only to the k highest-scoring experts instead of to every expert. This makes MoE computation sparse and efficient, but the choice of k trades off compute cost, robustness, and routing stability.
Load Balancing (MoE)Load balancing in MoE training adds losses or routing constraints so tokens are spread across experts instead of collapsing onto a few popular ones. It matters because uneven routing wastes capacity, creates bottlenecks, and leaves underused experts poorly trained.
Switch TransformerSwitch Transformer is a simplified MoE Transformer that routes each token to exactly one expert in each sparse feed-forward layer. Top-1 routing reduces communication and implementation complexity, enabling very large sparse models, but makes router stability and load balancing especially important.
Model MergingModel merging combines the weights or weight deltas of separately fine-tuned models into one model without full retraining. It can create multitask behavior cheaply, but naive averaging often causes interference unless the merged models share a common base and compatible parameter geometry.
Model SoupsModel soups average the weights of multiple fine-tuned models that lie in the same low-loss basin, often improving accuracy and robustness without extra inference cost. Unlike an ensemble, a soup is still a single model at serving time.
SLERP (Spherical Linear Interpolation)SLERP interpolates between two parameter vectors along the great-circle path on a sphere instead of using straight-line interpolation. In model merging it can preserve vector norms and sometimes produce smoother blends than linear interpolation when the two directions differ strongly.
TIES-MergingTIES-Merging is a model-merging method that reduces interference by trimming small parameter changes, resolving sign conflicts, and merging only updates aligned with an agreed sign. It is designed for cases where separately fine-tuned models disagree on which weights should move and in what direction.
DARE (Drop And REscale)DARE sparsifies fine-tuning deltas by randomly dropping many update entries and rescaling the rest, preserving most behavior while greatly reducing storage and merge interference. It is commonly used as a preprocessing step for delta compression or model merging rather than as a standalone training method.
Model CompressionModel compression reduces a model’s memory, latency, or energy cost while trying to preserve performance. Common compression methods include distillation, pruning, quantization, low-rank factorization, and architecture redesign.
Knowledge DistillationKnowledge distillation trains a smaller student model to match the outputs or internal representations of a larger teacher. It transfers some of the teacher’s behavior into a cheaper model, often using soft targets that contain more information than hard labels alone.
PruningPruning removes weights, neurons, heads, or entire blocks that contribute little to performance. It can reduce model size or compute, but aggressive pruning usually needs fine-tuning to recover lost accuracy.
Structured PruningStructured pruning removes whole channels, heads, layers, or blocks, producing regular sparsity that hardware can exploit directly. It usually yields better real-world speedups than unstructured pruning, though it gives less fine-grained control.
Unstructured PruningUnstructured pruning zeros individual weights wherever they appear unimportant, creating irregular sparsity patterns. It can achieve high compression ratios, but specialized kernels are often needed to turn that sparsity into real latency gains.
QuantizationQuantization stores or computes with lower-precision numbers, such as INT8 or 4-bit values, instead of full-precision floats. It reduces memory bandwidth and can speed inference, but accuracy depends on how well the lower-precision representation preserves weights and activations.
Post-Training Quantization (PTQ)Post-training quantization converts a trained model to lower precision after optimization is finished, usually using a calibration set to estimate activation ranges. It is easy and cheap to apply, but accuracy can drop more than with quantization-aware training at very low bit widths.
Quantization-Aware Training (QAT)Quantization-aware training simulates low-precision arithmetic during training so the model learns weights that remain accurate after quantization. It usually outperforms post-training quantization at low bit widths, but it adds training cost and implementation complexity.
Preference-Based AlignmentPreference-based alignment trains models from judgments such as ‘response A is better than response B’ instead of only from supervised targets. It is useful when desired behavior is easier for humans to compare than to specify as a single correct answer.
Reinforcement Learning from Human Feedback (RLHF)RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
Constitutional AIConstitutional AI aligns a model using an explicit list of principles that guide critique and revision, reducing the need for dense human feedback on every example. The constitution acts like a rule set for self-improvement, though the resulting behavior still depends on the chosen principles and training procedure.
Direct Preference Optimization (DPO)DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
Program-Aided Language ModelA program-aided language model uses the LLM to translate a problem into executable code, then lets an interpreter carry out the exact computation. This separates natural-language understanding from symbolic execution and often improves arithmetic or algorithmic reasoning over pure chain-of-thought.
FlashAttentionFlashAttention is an exact attention algorithm that uses tiling and kernel fusion to minimize reads and writes between GPU HBM and on-chip SRAM. It preserves standard attention outputs while greatly reducing memory traffic, which yields large speed and memory gains on long sequences.
Data ParallelismData parallelism replicates the model on multiple devices and splits each batch across them, synchronizing gradients after each step. It is the simplest way to scale training throughput, but every device still stores the full model unless sharding is added.
Model ParallelismModel parallelism splits one model across multiple devices because it is too large or compute-heavy for a single device. The split can happen by layers, tensors, experts, or sequence chunks, trading memory savings for extra communication.
Pipeline ParallelismPipeline parallelism partitions a model by layers across devices and sends microbatches through the partitions like an assembly line. It reduces per-device memory, but pipeline bubbles and stage imbalance can waste throughput if the schedule is poorly tuned.
Tensor ParallelismTensor parallelism shards individual large matrix operations across devices, such as splitting weight matrices by rows or columns. It is effective for very large Transformers, but the frequent collectives mean fast interconnects are important.
Context ParallelismContext parallelism distributes a long sequence across devices so context tokens and their attention-related work are sharded instead of fully replicated. It helps long-context training or inference scale beyond one device, but requires extra communication to preserve exact attention across chunks.
Fully Sharded Data Parallel (FSDP)Fully Sharded Data Parallel shards model parameters, gradients, and optimizer states across data-parallel workers, gathering full parameters only when needed for computation. It is the PyTorch analogue of ZeRO-style training and makes much larger models fit without custom model-parallel code.
Model ShardingModel sharding splits a model’s parameters across devices or storage tiers instead of keeping a full copy everywhere. It is a general systems technique used in tensor parallelism, FSDP, offloading, and large-model serving to reduce per-device memory requirements.
Mixed Precision TrainingMixed-precision training performs most computation in lower precision such as FP16 or bfloat16 while keeping selected quantities in higher precision for stability. It reduces memory use and often increases throughput without much accuracy loss when implemented carefully.
Long-Context PretrainingLong-context pretraining trains or continues training a model on examples with much longer sequences so it learns to use distant context instead of only fitting short windows. It is usually needed because simply changing positional scaling or the context limit does not teach robust long-range retrieval or reasoning.
Token IDA token ID is the integer index assigned to a token after tokenization. Models do not operate on raw text directly; they look up embeddings from token IDs and later map output logits back to IDs during decoding.
Vocabulary SizeVocabulary size is the number of distinct tokens a tokenizer can emit. A larger vocabulary shortens sequences but increases embedding and softmax size, while a smaller vocabulary produces longer sequences and more token fragmentation.
Subword TokenizationSubword tokenization splits text into frequent pieces smaller than words but larger than individual characters. It handles rare words and open vocabularies well by composing unfamiliar words from known subword units.
Special TokensSpecial tokens are reserved tokens with structural or control meaning, such as BOS, EOS, PAD, SEP, or mask tokens, rather than ordinary text content. They shape formatting, training objectives, and sometimes model behavior.
Padding TokenA padding token is a dummy token added so sequences in a batch have equal length. It should be ignored by the loss and usually masked from attention so it does not behave like real context.
BOS Token (Beginning of Sequence)A BOS token marks the beginning of a sequence and gives the model a consistent start symbol for conditioning generation or encoding. It can help define sequence boundaries and sometimes carries special training semantics.
EOS Token (End of Sequence)An EOS token marks the end of a sequence and tells the model where generation should stop. During training it teaches sequence termination, and during inference it is one of the main stopping conditions.
Nearest Neighbor SearchNearest neighbor search finds the stored vectors most similar to a query under a chosen distance or similarity metric. Exact search is simple but expensive at scale, so large systems often use approximate nearest neighbor indexes instead.
Sparse RetrievalSparse retrieval represents queries and documents with sparse term-based features such as inverted indexes, TF-IDF, or BM25. It excels at exact keywords and rare identifiers, but is weaker than dense retrieval on paraphrases and semantic matching.
Dense RetrievalDense retrieval represents queries and documents with learned dense embeddings and retrieves by vector similarity. It handles paraphrase and semantic matching better than sparse retrieval, but it can miss exact lexical constraints and usually relies on approximate nearest neighbor search.
GroundingGrounding means tying a model’s answer to external evidence, inputs, or world state rather than letting it generate from unsupported priors alone. In RAG or tool-use systems, grounding is what makes outputs traceable to retrieved context or observations.
FaithfulnessFaithfulness is whether a model’s output is supported by the provided input, source document, or chain of evidence. It differs from factuality because a summary can be perfectly faithful to a source that contains false claims.
Temperature ScalingTemperature scaling calibrates a classifier by dividing logits by a learned scalar temperature before the softmax. It often improves probability calibration on a validation set without changing the model’s ranking of classes.
Batch NormalizationBatch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
Layer NormalizationLayer normalization normalizes activations across features within each example rather than across the batch. It works well for variable-length sequences and small batch sizes, which is why it is standard in Transformers.
Gradient ClippingGradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
Weight DecayWeight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
Adam OptimizerAdam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
RMSPropRMSProp uses an exponential moving average of squared gradients to normalize updates, preventing the denominator from growing without bound as in AdaGrad. It is useful for nonstationary problems and was a key precursor to Adam.
Learning Rate ScheduleA learning-rate schedule changes the learning rate over training instead of keeping it constant. Schedules matter because they balance fast early progress with stable late optimization and often determine final performance as much as the base optimizer.
WarmupWarmup starts training with a small learning rate and gradually increases it during the first steps. It reduces early instability, especially in Transformers where large updates before optimizer statistics settle can cause divergence.
Cosine AnnealingCosine annealing decays the learning rate following a cosine curve from a high value to a low value, sometimes with restarts. It provides a smooth schedule that often works well in practice without needing many hand-tuned decay boundaries.
Label SmoothingLabel smoothing replaces hard one-hot targets with a mostly-correct probability distribution that assigns a small amount of mass to other classes. It regularizes overconfident classifiers and often improves generalization and calibration, though it can hurt when exact probabilities matter.
Tokenization PipelineA tokenization pipeline is the full process that turns raw text into model-ready inputs, including normalization, pre-tokenization, subword splitting, token-to-ID mapping, truncation, padding, and special-token insertion. Choices here directly affect sequence length, vocabulary coverage, and downstream behavior.
GPU AccelerationGPU acceleration uses highly parallel graphics processors to speed the matrix and tensor operations that dominate modern ML workloads. It matters because deep learning is mostly throughput-bound linear algebra, which GPUs execute far more efficiently than general-purpose CPUs.
CUDACUDA is NVIDIA’s parallel-computing platform and programming model for running general-purpose kernels on GPUs. In machine learning it is the software layer that makes GPU-accelerated training and inference practical, exposing massive parallelism, specialized libraries, and direct control over device memory.
Inference OptimizationInference optimization is the set of techniques that reduce serving latency, memory use, and cost while preserving acceptable quality. Common methods include quantization, batching, KV-cache optimizations, kernel fusion, speculative decoding, and architecture choices that trade a little flexibility for much higher throughput.
Instruction TuningInstruction tuning is supervised fine-tuning on instruction-response examples so a pretrained model learns to follow requests instead of merely continuing text. It improves task generality and usability, but it mainly changes behavior and format-following rather than adding much new world knowledge.
Safety AlignmentSafety alignment is the process of making a model reliably avoid harmful, deceptive, or policy-violating behavior while remaining useful. In practice it combines data curation, supervised tuning, preference optimization or RLHF, classifiers, and adversarial evaluation, but it never guarantees perfect safety.
Red-TeamingRed-teaming is adversarial evaluation in which people or automated systems deliberately try to break a model’s safeguards and expose failure modes. Its purpose is not to improve benchmark scores directly, but to find unsafe or brittle behavior before deployment.
What is a jailbreak in the context of LLMs?In the context of LLMs, a jailbreak is a prompt or interaction pattern that bypasses the model’s safety training or policy enforcement and elicits behavior it was supposed to refuse. Jailbreaks matter because they reveal that aligned behavior can be a thin behavioral layer rather than a deep guarantee.
Adversarial PromptingAdversarial prompting is the deliberate construction of inputs that push a model toward incorrect, unsafe, or unintended behavior. It includes jailbreaks, prompt injection, data exfiltration attempts, and other attacks that exploit weaknesses in instruction-following or context handling.
Mechanistic InterpretabilityMechanistic interpretability treats a neural network as a system to be reverse-engineered into circuits, features, and algorithms. Its goal is not just to correlate neurons with concepts, but to identify the actual internal computations that produce behavior.
Logit LensLogit Lens maps intermediate hidden states through the final unembedding matrix to inspect what tokens each layer already appears to favor. It is a convenient way to watch a Transformer’s computation unfold, though it is only approximate because earlier layers were not trained to be decoded directly.
Sparse Autoencoder (Mechanistic Interpretability)In mechanistic interpretability, a sparse autoencoder is trained on model activations to decompose dense, superposed representations into a larger set of sparse features. This often makes latent structure more interpretable, because individual learned directions can line up with human-readable concepts or behaviors.
Superposition (Neural Networks)Superposition is the phenomenon in which a network stores more features than it has obvious dimensions by packing them into overlapping directions. It explains why single neurons can look polysemantic and why sparse feature dictionaries are often more informative than neuron-by-neuron inspection.
Data AugmentationData augmentation expands a training set with label-preserving transformations such as crops, paraphrases, or noise injection. It improves generalization by teaching the model which variations should not change the answer.
Classification HeadA classification head is the final task-specific layer that maps learned representations to class logits or probabilities. In transfer learning it is often the only part trained from scratch, while the backbone provides reusable features.
Softmax HeadA softmax head is the output projection plus softmax normalization that converts hidden representations into a probability distribution over classes or vocabulary items. In language models it is the layer that turns the final hidden state into next-token probabilities.
Beam SearchBeam search is a decoding algorithm that keeps the top-scoring partial sequences at each step instead of only the single best one. It approximates high-probability generation better than greedy decoding, but it can still miss the global optimum and often reduces diversity.
Repetition PenaltyA repetition penalty is a decoding heuristic that downweights tokens or phrases the model has already used, reducing loops and bland repetition. It improves generation quality when a model is prone to degeneracy, but too much penalty can make text unnatural or incoherent.
Logit AdjustmentLogit adjustment means modifying logits to account for effects such as class imbalance, prior shift, or calibration goals before taking probabilities or losses. It changes the decision boundary in a simple way by shifting scores rather than changing the underlying representation.
Domain-Specific PretrainingDomain-specific pretraining continues or repeats pretraining on corpus data from a specialized domain such as law, medicine, or code. It improves vocabulary use, factual recall, and style in that domain, but it can also narrow the model or erode performance outside the target distribution.
Distributed Computing (ML Training)Distributed computing in ML training spreads computation, memory, or both across many devices and often many machines. It is what makes modern large-model training possible through strategies such as data parallelism, model parallelism, sharding, and pipeline execution.
Memory Optimization (ML Training)Memory optimization in ML training is the collection of techniques that reduce peak memory so larger models or batches fit on available hardware. Common examples are mixed precision, activation checkpointing, optimizer sharding, offloading, and more memory-efficient attention kernels.
Role-Playing (LLMs)Role-playing in LLMs means conditioning a model to adopt a persona, voice, or behavioral frame during generation. It is useful for simulation and product design, but it also shows how easily high-level behavior can be steered by prompt context.
Data MixtureA data mixture is the weighting and composition of different datasets or domains in a training run. It matters because capability, robustness, and bias often depend as much on what proportion of the data comes from each source as on the total token count.
GRPO (Group Relative Policy Optimization)GRPO is a policy-optimization method that scores sampled responses relative to others in the same group, using those relative rewards to update the policy. Its appeal is that it can improve reasoning performance while avoiding some of the memory overhead of PPO-style critic training.
Activation PatchingActivation patching is a causal analysis method where activations from one run are inserted into another to test which components matter for a given behavior. If patching a layer or head restores the behavior, that component is evidence for being on the relevant causal path.
Log-Sum-Exp TrickThe log-sum-exp trick computes expressions like log(sum(exp(x_i))) stably by subtracting the maximum logit before exponentiation. It prevents overflow and underflow, so it is a standard numerical tool in softmax, cross-entropy, and probabilistic inference.
Kolmogorov-Arnold NetworksKolmogorov-Arnold Networks replace fixed scalar weights on edges with learnable one-dimensional functions, so layers are built from sums of learned univariate transforms rather than simple affine maps. They are motivated by the Kolmogorov-Arnold representation theorem and are often discussed as a more interpretable alternative to MLPs, not a universal replacement.
Double DescentDouble descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.
GrokkingGrokking is a delayed generalization phenomenon in which a model first memorizes the training set and only much later snaps into a simple algorithm that generalizes well. It is interesting because the model already had enough capacity to fit the data, yet the more general solution emerged only after long training and regularization pressure.
State Space Models / MambaState space models such as Mamba process sequences by evolving a learned hidden state through recurrence rather than full quadratic attention. Their main appeal is linear-time sequence processing with strong long-context efficiency, especially when selective state updates let the model decide what to remember.
Chinchilla Scaling LawsChinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
Auxiliary Load-Balancing Loss (MoE)The auxiliary load-balancing loss in a Mixture-of-Experts model encourages the router to spread tokens more evenly across experts. Without it, routing often collapses onto a few experts, which wastes capacity and creates severe hot spots in both learning and systems performance.
Muon OptimizerMuon is an optimizer designed especially for matrix-valued parameters that replaces the raw update direction with an orthogonalized one. The point is to respect matrix structure rather than treating every weight tensor as a flattened vector, with the goal of improving training efficiency relative to standard first-order optimizers.
Pretraining Data DeduplicationPretraining data deduplication removes near-duplicate documents or passages from a training corpus. It improves per-token efficiency and reduces memorization and benchmark contamination, because repeatedly seeing the same text usually wastes compute more than it adds knowledge.
Process Reward Models (PRM) vs Outcome Reward Models (ORM)Outcome reward models score only the final answer, while process reward models score the intermediate steps of a solution. PRMs provide denser supervision and better guidance for search and long-form reasoning, but they require more fine-grained labels and more complex evaluation.
KTO (Kahneman–Tversky Optimization)KTO is a preference-optimization objective that learns from binary desirable-versus-undesirable labels instead of pairwise rankings. It uses a utility formulation inspired by prospect theory, making it a cheaper alternative when collecting full preference comparisons is too expensive.
RLAIF (RL from AI Feedback)RLAIF replaces human preference labels with judgments produced by another AI model following a rubric. It scales alignment data collection much more cheaply than RLHF, but it also transfers the biases and blind spots of the judge model into the training signal.
Reasoning Models (o1 / R1-style Long-CoT)Reasoning models in the o1 or R1 style are language models trained or prompted to spend extra inference compute on long multi-step reasoning before answering. Their key idea is that better reasoning can come not only from bigger models, but from better search, verification, and credit assignment at inference and post-training time.
Process SupervisionProcess supervision trains a model on the quality of intermediate reasoning steps rather than only on whether the final answer is correct. It improves credit assignment for long solutions and makes verification more local, though collecting reliable step-level labels is expensive.
Monte Carlo Tree Search for LLM ReasoningMonte Carlo Tree Search for LLM reasoning treats partial solution paths as tree nodes, expands candidate continuations, and uses rollouts or value estimates to decide where to search next. It is attractive because it turns one-shot generation into guided search over reasoning trajectories instead of committing immediately to a single chain of thought.
Self-Refine / ReflexionTwo closely related inference-time techniques in which an LLM critiques and revises its own output over multiple rounds. Self-Refine uses the same model in three roles (generate → feedback → refine); Reflexion adds an episodic memory of past failures to guide future trajectories in agentic tasks.
vLLM & Continuous BatchingvLLM is an LLM serving system built around PagedAttention and continuous batching. Instead of waiting for a batch to finish, it admits and schedules requests at each decoding step, which reduces padding waste and improves throughput for variable-length generations.
GPTQ QuantizationA one-shot, layer-wise post-training quantization that pushes LLM weights to 3–4 bits while preserving generation quality. GPTQ reformulates quantization as a per-row error-minimisation problem solved greedily using the inverse Hessian of a small calibration set, achieving 3-bit LLaMA-65B with <1% perplexity loss.
AWQ (Activation-aware Weight Quantization)AWQ is a post-training quantization method that preserves quality by protecting the weights attached to the largest activation channels before rounding. It targets low-bit LLM inference with small accuracy loss and is popular because it is simpler to deploy than Hessian-based methods such as GPTQ.
Attention Sinks / StreamingLLMAttention sinks are the first few tokens in a causal Transformer that absorb disproportionate attention from later positions, even when they carry little semantic content. StreamingLLM exploits this by keeping sink tokens and a short recent window in the KV cache, enabling long streaming inference with bounded memory.
Induction HeadsA specific two-head circuit in Transformer attention that copies the next token after a previous occurrence of the current token — the computational basis for in-context learning. Anthropic showed induction heads form suddenly during training, coinciding with the sharp jump in ICL ability.
Circuit AnalysisThe mechanistic-interpretability practice of identifying subgraphs of weights, residual-stream components, and attention heads that jointly implement a human-interpretable algorithm (indirect object identification, modular addition, greater-than). Circuit analysis produces falsifiable, causal accounts of what a network has learned.
ROME / MEMIT Model EditingRank-one edits to MLP weights that inject a single fact (ROME) or thousands of facts (MEMIT) into a pretrained LLM without retraining. They exploit the observation that MLP blocks act as key–value memories, identify the causal neurons via activation patching, and solve a closed-form optimisation problem for the minimal-norm weight update.
Core LLM Benchmarks (MMLU, HumanEval, GSM8K, MATH)MMLU tests broad academic knowledge, HumanEval tests code generation by unit tests, GSM8K tests grade-school math word problems, and MATH tests harder symbolic reasoning. Together they cover knowledge, code, and reasoning, but all can be gamed or saturated, so they are only a partial view of model quality.
LMSYS Chatbot ArenaLMSYS Chatbot Arena is a crowdsourced pairwise-evaluation platform where users compare two anonymous models by chatting and voting. Its Elo-style ranking captures interactive preference better than a single benchmark, but it is noisy and sensitive to traffic mix and prompt selection.
KV Cache Compression (H2O, SnapKV)Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
Chunked PrefillA serving-time technique that breaks the long prefill of a prompt into small chunks and interleaves them with decode steps of other requests. By keeping GPU utilisation high during prefill and avoiding long tail latencies, chunked prefill dramatically improves throughput in mixed-batch LLM serving.
Disaggregated Prefill/Decode ServingDisaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.
Paged vs Block KV CacheTwo allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
Tensor Cores & GEMM FundamentalsTensor cores are specialised matrix-multiply units in NVIDIA GPUs (introduced in Volta, 2017) that execute small mixed-precision matrix-multiply-accumulate (MMA) ops per clock. Peak DL throughput is set by tensor-core flops; structured matrix multiplies (GEMM) that tile the problem to tensor-core shape are how deep learning touches that ceiling.
BF16 / FP8 / MXFP4 Number FormatsThese are low-precision number formats used to trade numerical precision for speed and memory efficiency in modern ML hardware. BF16 is the standard training workhorse, FP8 is increasingly used for faster training and inference, and 4-bit floating formats push efficiency further for aggressive inference optimization.
DeepSpeed ZeRO-Infinity / OffloadingAn extension of ZeRO that offloads optimizer states, gradients, and parameters to CPU RAM and NVMe SSDs, enabling training of trillion-parameter models on modest GPU clusters. ZeRO-Infinity (Rajbhandari et al., 2021) uses bandwidth-aware partitioning and overlap to hide the offload latency.
Ring Attention / Context Parallel for Long SequencesA distributed-attention algorithm that shards an \( n \)-token sequence across \( P \) devices and computes each attention output via a ring of key-value rotations. Ring Attention (Liu et al., 2023) enables context lengths of millions of tokens on multi-GPU clusters with near-linear scaling.
Goodhart's Law and Specification GamingGoodhart's Law says that once a measure becomes a target, optimizing it can break its usefulness as a proxy. In AI safety and ML systems this appears as specification gaming: the model finds ways to maximize the metric without achieving the intended goal.
Lottery Ticket HypothesisA dense randomly-initialised neural network contains subnetworks ("winning tickets") that — when trained in isolation with their original initialisation — match the full network's accuracy in the same number of steps. This Frankle–Carbin observation motivates one-shot and iterative magnitude pruning as search algorithms for sparse trainable subnetworks, reframing pruning as an initialisation search rather than a post-hoc compression.
Loss Landscape: Flat vs Sharp MinimaFlat minima (low curvature / small Hessian eigenvalues) generalise better than sharp minima (high curvature), empirically and via PAC-Bayes bounds. SGD's noise, large batch sizes, and the Sharpness-Aware Minimisation (SAM) optimiser all interact with this: small-batch SGD prefers flat minima, large-batch SGD falls into sharper ones, and SAM explicitly penalises sharpness during training.
Implicit Regularisation of SGDOver-parameterised networks trained by SGD generalise despite being able to fit pure noise — SGD's trajectory biases the solution toward specific minima. For linear models, gradient flow converges to the minimum-norm interpolator; for deep nets, SGD with small LR and moderate batch behaves like Bayesian inference with an implicit prior on flat minima. This implicit bias is why modern deep learning does not need explicit capacity control.
Mixture of Depths (MoD)Mixture of Depths (Raposo et al., 2024) lets each token choose whether to go through the expensive self-attention + MLP stack at each layer, or to skip it via a residual. A small router predicts a saliency score; the top-\( k \) tokens per batch compute, the rest pass through. This per-token adaptive compute is the depth-axis counterpart of Mixture-of-Experts (width-axis) and substantially reduces FLOPs at matched quality.
GPU Memory Hierarchy and the Roofline ModelThe GPU memory hierarchy ranges from slow, large global memory to faster on-chip caches, shared memory, and registers. The roofline model explains performance by comparing arithmetic intensity with hardware limits: low-intensity kernels are memory-bound, while high-intensity kernels are compute-bound.
NCCL Collectives: All-Reduce, All-Gather, Reduce-ScatterNCCL (NVIDIA Collective Communications Library) implements ring and tree algorithms for GPU-to-GPU communication primitives — the building blocks of distributed deep learning. All-reduce sums gradients across workers; reduce-scatter + all-gather are the two halves of all-reduce, exploited by ZeRO and FSDP to shard parameters. Understanding these primitives is essential to reasoning about training scalability.
Arithmetic IntensityArithmetic intensity is the number of floating-point operations performed per byte of data moved from memory. In the roofline model it determines whether a kernel is memory-bound or compute-bound, which is why matmuls are efficient and elementwise ops often are not.
Triton GPU Kernel ProgrammingTriton (OpenAI, 2019) is a Python-embedded DSL for writing high-performance GPU kernels. It exposes tile-level primitives (block pointers, block-scoped arithmetic, automatic vectorisation) while hiding shared-memory management and thread-level scheduling. FlashAttention-2, Mamba kernels, many PyTorch 2 inductor-generated kernels, and most modern custom ops are written in Triton — fast enough to match hand-tuned CUTLASS with far less code.
Natural Gradient & Fisher–Rao GeometryThe natural gradient \( \tilde\nabla_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L} \) preconditions the Euclidean gradient by the inverse Fisher information matrix, yielding steepest descent under the KL-divergence metric on the statistical manifold. It underlies K-FAC, Shampoo, TRPO's trust region, and the original motivation for reparameterisation-invariant optimisation.
Sharpness-Aware Minimization (SAM)Minimise a loss whose value is worst-case over a small \( \rho \)-ball of weight perturbations: \( \min_\theta \max_{\|\varepsilon\| \le \rho} \mathcal{L}(\theta + \varepsilon) \). The ascent step \( \varepsilon^\star \approx \rho \, \nabla \mathcal{L}/\|\nabla \mathcal{L}\| \) biases training toward flat minima, improving generalisation across ViT, ResNet, and LLM finetuning.
Lion OptimizerA momentum-only optimizer discovered by neural search: updates are \( \theta \leftarrow \theta - \eta \, \text{sign}(\beta_1 m + (1-\beta_1) g) \). Uses only the sign of a momentum estimate — no second moment, half the state of AdamW, often matches or beats it on large-scale training with smaller learning rates.
Shampoo & K-FAC PreconditionersShampoo and K-FAC are second-order-inspired optimizers that precondition gradients with matrix or blockwise curvature information instead of only per-parameter learning rates. They aim to converge in fewer steps than Adam or SGD, especially in large-batch training where curvature estimates are more stable.
AdaFactor OptimizerA memory-efficient Adam variant. Replaces the full second-moment matrix with row and column running averages (a rank-1 factorisation): memory is \( O(m + n) \) instead of \( O(mn) \). Used by T5, Meta's BlenderBot, and many large-scale models to free memory for bigger batches and longer contexts.
Gradient Accumulation & Micro-BatchingSplit a large effective batch into \( k \) micro-batches; accumulate their gradients in a buffer, then step once. Decouples statistical batch size from hardware batch size, enabling \( N\cdot k \) effective batch without a \( k\times \) memory blow-up. Essential for training large models on any GPU.
Data Curation & Quality Filters (FineWeb, Dolma)Modern pretraining pipelines filter terabytes of web data through language ID, heuristic rules (repetition, punctuation ratios), classifier-based quality scoring, and toxicity / PII removal. The FineWeb and Dolma recipes document which filters mattered — often delivering per-token quality gains equivalent to 2–3× scale-up.
MinHash & LSH for Large-Scale DeduplicationMinHash approximates Jaccard similarity between sets, and locality-sensitive hashing uses that approximation to quickly find near-duplicates. In ML data pipelines this is used to deduplicate documents or examples at very large scale.
Continual Pretraining & Mid-TrainingContinue pretraining an existing base model on a domain or task-focused corpus (code, math, a new language) before final post-training. Achieves domain gains that would cost 10× more to obtain by fine-tuning alone. Sits between pretraining and SFT in modern recipes.
Long-Context Data Recipes (RULER, Needle Variants)Extending effective context beyond 128k requires (a) RoPE-scaling or position-interpolation to keep positional encodings sane, (b) a continued-pretraining dataset with real long documents and synthetic stitched tasks, and (c) evaluation beyond simple needle-in-a-haystack — RULER adds multi-needle, multi-hop, and aggregation subtasks that expose superficial-match shortcuts.
Tokenizer Training Dynamics & Vocab SizingTokenizer design trades vocabulary size against sequence length and changes both compute cost and what patterns the model can represent cleanly. BPE, unigram, and byte-level schemes make different compromises, especially for code, multilingual text, and rare domain terms.
FlashAttention-2 and FlashAttention-3FlashAttention-2 and FlashAttention-3 are follow-on attention kernels that keep exact attention outputs while running much faster through better tiling, parallelism, and data movement. FA-2 improves work partitioning on modern GPUs, while FA-3 adds Hopper-specific asynchronous pipelines and low-precision support.
Medusa & EAGLE Speculative-Decoding HeadsSpeculative decoding with learned draft heads instead of a separate draft model. Medusa adds \( K \) small lightweight heads on the base model predicting future tokens; the base verifies tree-hypotheses in one forward pass. EAGLE models the residual stream directly and achieves 3–4× speedup with a tiny draft network.
Lookahead DecodingExact speculative-like decoding without any draft model: maintain a running n-gram 'Jacobi window' that proposes multiple tokens ahead in parallel; verify in one pass. Lossless — outputs match greedy decoding exactly — and requires no extra training. Trades increased per-step compute for fewer sequential steps.
Radix / Prefix-Cache Attention (SGLang)Share the KV cache across requests that start with a common prompt prefix. Store prefix trees keyed by token sequence; on a new request, find the longest matching prefix in the cache and reuse it. Cuts prefill latency and memory use for chat applications with shared system prompts or few-shot contexts.
Quantized KV Cache (int4 / int8 / KIVI)Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
Continuous vs Static BatchingStatic batching groups requests before a forward pass and runs them to completion together — tail latency is set by the slowest request. Continuous batching (Orca, vLLM) evicts finished requests mid-step and admits new ones each iteration, keeping GPU utilisation high and tail latency bounded. Default in production LLM serving.
Mean Field Theory of Neural NetworksMean-field theory studies very wide neural networks by tracking distributions of parameters or activations instead of individual weights. It yields clean scaling limits for training dynamics and feature learning, and helps distinguish true feature-learning regimes from the lazy-training NTK regime.
Information Bottleneck TheoryInformation Bottleneck theory studies representations that preserve information about the target while compressing information about the input, often through a trade-off like \(I(Z;Y) - eta I(Z;X)\). It is a useful lens on representation learning and generalization, though its direct explanatory power for deep networks remains debated.
Stability and GeneralizationAn algorithm is uniformly \( \beta \)-stable if replacing one training point changes its output's loss by at most \( \beta \). Bousquet & Elisseeff (2002) proved that \( \beta \)-stability bounds the generalization gap by \( O(\beta + 1/\sqrt{n}) \); Hardt, Recht & Singer (2016) showed SGD on smooth losses is \( O(T/n) \)-stable, giving the first algorithm-dependent generalization bound for deep learning that grows with training time.
Algorithmic Alignment TheoryA neural architecture generalises better on a reasoning task when its computational structure aligns with the algorithm that solves the task. Xu et al. (2020) formalise sample complexity in terms of the number of network modules that must be learned and the per-module learnability, predicting that GNNs (multi-step message passing) align with dynamic-programming algorithms while plain MLPs do not.
Spectral Bias of Neural NetworksSpectral bias is the tendency of gradient-trained neural networks to learn low-frequency or smooth components of a target function before high-frequency ones. This helps explain why neural nets often fit coarse structure early and fine detail later.
Neural CollapseAt the terminal phase of training (TPT) — long after zero training error — the last-layer features and classifier weights of a deep classifier converge to a highly symmetric configuration: per-class feature means form a Simplex Equiangular Tight Frame (ETF), within-class variability collapses to zero, classifier weights align with the class means, and prediction reduces to nearest-class-centre. Papyan, Han & Donoho (2020) established this as a robust empirical phenomenon across architectures and datasets.
Mode Connectivity in Loss LandscapesMode connectivity is the empirical finding that independently trained solutions can often be connected by a low-loss path in parameter space. This suggests that many minima in deep learning are not isolated basins but parts of wider connected regions.
Gradient Noise ScaleA scalar diagnostic that estimates the largest useful batch size by comparing the variance of per-example gradients to the squared norm of the mean gradient: \( B_{\text{noise}} = \operatorname{tr}(\Sigma)/\|g\|^2 \). McCandlish et al. (2018) argue that returns to scaling batch size diminish sharply once \( B \gg B_{\text{noise}} \), giving a principled way to choose batch size during large-scale training.
Adaptive Gradient Clipping (AGC)A per-parameter clipping rule introduced by Brock et al. (2021) that bounds each weight's update by a fraction of the weight's own norm: clip \( g \) so \( \|g\|/\|w\| \le \lambda \). Unlike global-norm clipping, AGC scales naturally with parameter magnitude and made it possible to train Normalizer-Free Networks (NFNets) without batch normalisation while matching its training stability.
Self-Paced LearningA curriculum-learning variant where the model itself decides which examples are 'easy enough' to train on at the current step, by minimising a joint objective \( \sum_i v_i \ell_i - \lambda \sum_i v_i \) over both parameters \( \theta \) and per-example weights \( v_i \in \{0,1\} \) (or \( [0,1] \)). Kumar, Packer & Koller (2010) introduced it as a non-convex EM-style alternative to handcrafted curricula; \( \lambda \) is annealed from low (only easy examples) to high (all examples).
Loss Landscape VisualizationMethods for visualising high-dimensional loss surfaces by projecting parameters onto 1-D or 2-D subspaces. Goodfellow's linear interpolation (2014) plots loss along the line between two solutions; Li et al.'s filter normalisation (2018) plots loss in a 2-D plane spanned by random Gaussian directions normalised per-filter. The latter reveals that residual connections smooth the landscape and that flat minima correspond to wide bowls in the visualisation.
Gradient Surgery (PCGrad) for Multi-Task LearningWhen two task gradients in multi-task learning point in conflicting directions (negative cosine), they partially cancel each other and slow learning. Yu et al.'s PCGrad (2020) projects each task gradient onto the normal plane of any conflicting task's gradient before summing, removing the destructive component. This 'gradient surgery' restores monotone progress on both tasks at modest cost.
Compressed Sparse Attention (CSA)Compressed Sparse Attention (CSA) is a long-context attention scheme that first compresses the KV cache into block summaries and then performs sparse attention only over the top-k relevant compressed blocks. An added sliding-window branch preserves exact local dependencies, so CSA cuts both KV memory and long-context attention compute without collapsing into a purely local window.
Distributed Training (Data & Model Parallelism)Distributed training scales learning across many devices by splitting either the data, the model, or both. Data parallelism is simplest when the model fits on each device, while model parallel approaches such as tensor and pipeline parallelism are needed when parameters or activations are too large for one accelerator.
Efficient Inference (Distillation, Pruning)Efficient inference reduces latency, memory, or serving cost without retraining a model from scratch, often by distilling a smaller student or pruning unimportant weights and structures. Distillation transfers behavior; pruning removes computation, and the two are often combined with quantization for deployment-grade compression.
Serving LLMs at ScaleServing LLMs at scale is a systems problem of jointly optimizing prompt prefill throughput, token-by-token decode latency, KV-cache memory, batching policy, and fleet utilization. Modern serving stacks rely on continuous batching, prefix caching, PagedAttention, speculative decoding, and sometimes prefill/decode disaggregation to keep both tail latency and GPU cost under control.
Dataset Versioning & LineageDataset versioning and lineage track exactly which raw data, labels, transformations, and filters produced a training or evaluation set. They matter because reproducibility, compliance, rollback, and debugging all depend on being able to answer "which data built this model?" with more precision than a folder name or timestamp.
Feature StoresFeature stores are systems for defining, computing, and serving reusable machine-learning features consistently across training and production. Their core promise is point-in-time correctness and train/serve consistency: the feature a model saw offline should match the feature served online for the same entity and timestamp.
ML System Monitoring & Drift DetectionML system monitoring tracks whether a deployed model is still receiving the kind of data it was built for and whether its business and technical behavior remain acceptable. Drift detection is one part of that job: teams also monitor latency, calibration, feature freshness, label delay, feedback loops, and downstream outcomes, because data drift alone does not tell the whole production story.
RLHF as KL-Regularized Policy OptimizationA deeper theoretical view of RLHF treats post-training as optimizing a policy against a learned reward while regularizing toward a reference model with a KL penalty. This viewpoint explains why PPO-RLHF, reward-model training, and even DPO-style objectives are closely related: they are different ways of solving or approximating the same regularized preference-optimization problem.