Tag: transformer

188 topic(s)

Layer Dropping and Progressive Pruning (TrimLLM)Layer dropping and progressive pruning reduce inference cost by cutting transformer depth rather than shrinking every matrix. TrimLLM does this progressively for domain-specialized LLMs, exploiting the empirical fact that not all layers are equally important in a target domain and aiming to retain in-domain accuracy while reducing latency.
Test-Time Compute ScalingTest-time compute scaling improves a model by spending extra computation at inference time, for example through search, verification, reranking, or adaptive refinement, instead of only scaling pretraining. It is most useful on prompts where the base model already has some chance of success, because additional compute can then amplify that success more efficiently than a much larger one-shot model.
GPT-3 & Few-Shot In-Context LearningGPT-3 showed that a 175B-parameter autoregressive Transformer can perform many tasks from natural-language instructions and a few demonstrations in the prompt, without gradient updates or task-specific fine-tuning. That result made in-context learning a central paradigm and showed that scale alone could unlock strong few-shot behavior.
Mechanistic OOCR Steering VectorsMechanistic OOCR steering vectors are a proposed explanation for some out-of-context reasoning results: fine-tuning can act like adding an approximately constant steering direction to the residual stream, rather than learning a deeply conditional new algorithm. That helps explain why a tuned behavior can generalize far beyond the fine-tuning data and why injecting or subtracting the vector can often reproduce or remove it.
Critical Representation Fine-Tuning (CRFT)Critical Representation Fine-Tuning (CRFT) is a PEFT method that improves reasoning by editing a small set of causally important hidden states instead of updating model weights broadly. It identifies critical representations through information-flow analysis and learns low-rank interventions on those states while keeping the base model frozen.
ZeRO (Zero Redundancy Optimizer)ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and eventually parameters across data-parallel workers so each GPU no longer stores a full copy of the training state. This cuts memory dramatically and makes very large-model training feasible without requiring full model-parallel architectures.
T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
Sparsely-Gated Mixture of Experts (MoE)A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
Luong Attention (Global and Local)Luong attention is a sequence-to-sequence attention mechanism that scores decoder states against encoder states using multiplicative forms such as dot or bilinear attention. It distinguishes global attention over all source positions from local attention over a predicted window, helping make neural machine translation more scalable.
Xavier/Glorot InitializationXavier or Glorot initialization chooses weight variance from fan-in and fan-out so activations and gradients stay roughly stable across deep layers. It is well suited to symmetric activations such as tanh, while ReLU networks usually prefer He initialization.
Weight TyingWeight tying uses the same matrix for token embeddings and the output softmax projection, typically by setting the output weights to the transpose of the input embedding table. This cuts parameters and often improves language modeling by forcing input and output token representations to share geometry.
Gradient Checkpointing (Activation Recomputation)Gradient checkpointing saves memory by storing only selected activations during the forward pass and recomputing the missing ones during backpropagation. The trade-off is extra compute for lower peak memory, which is why it is widely used to train large Transformers that would otherwise not fit in GPU memory.
PagedAttentionPagedAttention stores the KV cache in fixed-size non-contiguous blocks, like virtual-memory pages, instead of requiring one contiguous allocation per sequence. This largely removes fragmentation, enables prompt-prefix sharing, and is a key reason vLLM can serve many more concurrent requests.
Speculative DecodingSpeculative decoding speeds up autoregressive generation by letting a small draft model propose several tokens and then having the large target model verify them in parallel. With the rejection-sampling correction from the original algorithm, the output distribution remains exactly the same as sampling from the target model alone.
Next-Token Prediction Objective (Causal Language Modeling)Next-token prediction trains a causal language model to assign high probability to each token given all previous tokens. Maximizing this likelihood over large text corpora teaches the model syntax, facts, and reusable patterns that later support prompting and generation.
AdamW OptimizerAdamW is Adam with decoupled weight decay: parameter shrinkage is applied directly to the weights instead of being mixed into the adaptive gradient update. This preserves the intended regularization effect and is why AdamW became the default optimizer for many Transformer models.
SwiGLU Activation FunctionSwiGLU is a gated feed-forward activation that multiplies one linear projection by a Swish-activated gate from another projection. It usually performs better than standard ReLU-style MLP blocks at similar scale, which is why many modern LLMs use it in their feed-forward layers.
Pre-Norm vs. Post-Norm ArchitecturePre-Norm vs. Post-Norm is the choice of whether layer normalization is applied before or after each residual sublayer in a Transformer block. Pre-Norm usually trains deeper stacks more stably by preserving gradient flow through the residual path, while Post-Norm was the original design and can be less stable at scale.
Byte Pair Encoding (BPE)Byte Pair Encoding is a subword tokenization method that repeatedly merges the most frequent adjacent symbols in a corpus. It builds a vocabulary between characters and whole words, which handles rare words better than word-level tokenization while keeping sequence lengths manageable.
Softmax TemperatureSoftmax temperature rescales logits before softmax to control randomness in the output distribution. Lower temperature makes probabilities sharper and decoding more deterministic, while higher temperature flattens the distribution and increases diversity.
Key-Value (KV) CachingKey-value caching stores the attention keys and values from earlier tokens during autoregressive decoding so they do not need to be recomputed at every step. It speeds up generation dramatically, but the cache grows with sequence length and turns inference into a memory-management problem.
Rotary Positional Embedding (RoPE)Rotary Positional Embedding encodes position by rotating query and key vectors with token-index-dependent angles before attention is computed. Because the resulting dot products depend on relative offsets, RoPE gives Transformers a simple and widely used way to represent order.
Sinusoidal Positional EncodingSinusoidal positional encoding adds fixed sine and cosine patterns of different frequencies to token embeddings so the model can infer token order. The encoding is deterministic and smooth across positions, which let the original Transformer represent position without learning a separate table.
Causal (Masked) Self-AttentionCausal masked self-attention is self-attention with a mask that prevents each position from attending to future tokens. Applying the mask before softmax enforces autoregressive order, so the model can predict the next token without seeing the answer in advance.
DropoutDropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
Linear functionA linear function satisfies additivity and homogeneity, so it can be written as a matrix map with no bias term. In machine learning people often use 'linear' loosely for affine maps, but mathematically the distinction matters because adding a bias breaks true linearity.
Affine transformationAn affine transformation is a linear map followed by a translation, so it has weights and a bias. Dense neural network layers are affine rather than strictly linear, because the bias lets the model shift activations and decision boundaries.
ReLUReLU outputs the positive part of its input and zero otherwise. It became the default activation in many deep networks because it is simple, cheap, and far less prone to saturation than sigmoid or tanh, though units can still die if they stay on the zero side.
SoftmaxSoftmax turns a vector of logits into a probability distribution by exponentiating and normalizing them so the components sum to one. It is commonly used for multiclass prediction because it converts arbitrary scores into class probabilities while preserving their ranking.
Multilayer perceptron (MLP)A multilayer perceptron is a feedforward neural network made of stacked fully connected layers and nonlinear activations. It is the canonical dense architecture for tabular function approximation and the feed-forward subnetwork inside many Transformer blocks.
Embedding layerAn embedding layer maps discrete IDs such as words, subwords, or items to learned dense vectors. It is essential whenever symbolic inputs must be represented in a continuous space that gradient-based models can manipulate.
Embedding vectorAn embedding vector is the dense continuous representation assigned to a discrete token, item, or entity by an embedding table or model. Its meaning comes from geometry: similar entities tend to occupy nearby directions or neighborhoods in the learned space.
Word embeddingA word embedding is a dense vector representation of a word learned from distributional context rather than hand-coded features. Its purpose is to place semantically or syntactically related words near one another in vector space so downstream models can generalize across vocabulary items.
Semantic similaritySemantic similarity is the degree to which two words, sentences, or documents share meaning rather than just surface form. In machine learning it is often estimated with embeddings and cosine similarity, which turns meaning comparison into a geometric problem.
Cosine similarityCosine similarity measures the angle between two vectors: \( \cos \theta = x \cdot y / (\|x\| \|y\|) \). It ignores magnitude and compares direction, which is why it is the default similarity metric for embeddings in retrieval, clustering, and semantic search.
TokenizationTokenization is the process of splitting raw text into model-readable tokens such as words, subwords, bytes, or characters. It determines vocabulary size, sequence length, and how efficiently a language model handles rare words, multilingual text, and code.
TokenA token is the discrete unit a language model reads and predicts. Depending on the tokenizer, a token may be a word, subword, byte, punctuation mark, or special control symbol, and token count determines both context usage and API cost.
SubwordA subword is a token unit smaller than a full word but larger than a character, learned to balance vocabulary size against sequence length. Subword tokenization lets models handle rare and novel words by composing them from reusable pieces.
VocabularyA vocabulary is the fixed set of tokens a tokenizer can map text into and a model can natively process. Its size trades off compression against flexibility: larger vocabularies shorten sequences, while smaller ones rely more on subword or byte composition.
Autoregressive language modelAn autoregressive language model generates text left-to-right by modeling \( P(w_t \mid w_{<t}) \) for each token. Because it only conditions on past tokens, it can be used directly for open-ended generation as well as scoring sequences.
Masked language modelA masked language model is trained to recover tokens hidden within a sequence using both left and right context. This bidirectional training makes MLMs strong encoders for understanding tasks, but less natural than autoregressive models for direct generation.
Causal language modelA causal language model predicts each token using only earlier tokens, enforced by a causal attention mask. It is essentially the same modeling family as an autoregressive language model, with the word 'causal' emphasizing the masking constraint in self-attention.
Chat language modelA chat language model is a pretrained LLM further tuned to follow instructions and handle multi-turn dialogue. It is usually built by supervised fine-tuning plus preference optimization or RLHF, so it behaves more helpfully and safely than the raw base model.
Negative log-likelihoodNegative log-likelihood is the loss obtained by negating the log-likelihood, so maximizing probability becomes minimizing a positive objective. It is the standard training loss for probabilistic classifiers, language models, and many generative models.
Conditional probabilityConditional probability is the probability of an event after restricting attention to cases where another event is known to occur, written \( P(A \mid B) = P(A,B)/P(B) \). It is the basic object behind Bayes' rule, autoregressive models, and all context-dependent prediction.
Discrete probability distributionA discrete probability distribution assigns nonnegative probabilities to a countable set of outcomes that sum to 1. Softmax outputs in classification and next-token prediction are discrete distributions over labels or vocabulary items.
Cross-entropyCross-entropy measures the average coding cost of samples from a true distribution \( p \) when encoded using a model distribution \( q \). In ML it is the standard loss for classification and language modeling, and minimizing it is equivalent to maximum likelihood up to an entropy constant.
Multiclass classificationMulticlass classification assigns each input to exactly one of \( K>2 \) mutually exclusive classes. Models usually produce a softmax distribution over classes and train with cross-entropy against a one-hot or label-smoothed target.
TransformerA Transformer is a sequence model built from self-attention, position-wise MLPs, residual connections, and normalization, rather than recurrence or convolution. Its key advantage is that every token can directly attend to every other token in parallel, which made modern LLM scaling practical.
Decoder BlockA decoder block is the basic unit of a decoder-only Transformer: causal self-attention plus a position-wise MLP, wrapped with residual connections and normalization. Stacking these blocks lets the model mix context across tokens while preserving autoregressive generation.
Decoder-only TransformerA decoder-only Transformer is a Transformer architecture composed only of masked self-attention blocks, so each token can attend only to earlier tokens. This makes it the standard architecture for autoregressive language models such as GPT, LLaMA, and Claude.
Self-AttentionSelf-attention lets each token compute a weighted combination of representations from other tokens in the same sequence, with weights determined by query-key similarity. It is the mechanism that gives Transformers flexible, content-dependent context mixing without recurrence.
Attention ScoreAn attention score is the compatibility value computed between a query and a key before normalization, often by dot product or a learned variant. Higher scores mean the corresponding token or memory slot should receive more weight after the softmax.
What is a scaled attention score?A scaled attention score is a query-key dot product divided by \( \sqrt{d_k} \) before softmax. The scaling keeps the variance of the logits from growing with key dimension, which helps prevent softmax saturation and keeps gradients well behaved.
Masked Attention ScoreA masked attention score is an attention logit after adding a mask that blocks forbidden positions, typically by adding a very large negative value before softmax. This forces the resulting attention weight to be effectively zero at those positions.
Attention WeightsAttention weights are the normalized coefficients, usually produced by a softmax over attention scores, that determine how much each value vector contributes to the output. They form a distribution over positions or memory entries for each query.
Attention MaskAn attention mask is a tensor that tells an attention layer which positions may interact and which must be blocked. It is used for causal generation, padding suppression, and task-specific visibility patterns, and it must be applied before softmax, not after.
Causal MaskA causal mask blocks attention to future positions by masking entries above the sequence diagonal. It enforces left-to-right autoregressive prediction, ensuring that token \( t \) can depend only on tokens \( \le t \).
Multi-Head AttentionMulti-head attention runs several attention mechanisms in parallel on different learned projections of the same input, then concatenates their outputs. This lets the model capture multiple relational patterns at once instead of forcing all interactions through a single attention map.
Attention HeadAn attention head is one parallel query-key-value attention computation inside multi-head attention. Different heads can specialize to different patterns, such as local syntax, long-range dependencies, or induction-like copying behavior.
Grouped-Query Attention (GQA)Grouped-query attention shares key and value heads across groups of query heads, reducing KV-cache size and bandwidth during inference. It sits between full multi-head attention and multi-query attention, preserving most quality while making long-context serving cheaper.
Query, Key, Value (QKV)Query, key, and value are the three learned projections used by attention: the query asks what to look for, the key says what each position offers, and the value is the content returned if that position is attended to. Attention weights come from query-key similarity, but outputs are weighted sums of values.
Projection MatrixA projection matrix is a learned linear map that transforms vectors into another representation space. In Transformers, separate projection matrices create Q, K, and V from hidden states, and another projection maps concatenated head outputs back to the model dimension.
Position-wise MLPA position-wise MLP is the feed-forward sublayer in a Transformer block, applied independently to each token after attention. It adds nonlinearity and channel mixing per token, complementing attention, which mixes information across positions.
Residual Connection (Skip Connection)A residual connection adds a layer's input back to its output, so the layer learns a correction rather than an entirely new representation. This stabilizes optimization, improves gradient flow, and is one reason very deep networks and Transformers train reliably.
Root Mean Square Normalization (RMSNorm)RMSNorm normalizes activations by their root mean square without subtracting the mean. Compared with LayerNorm it is slightly cheaper and often just as effective, which is why many modern LLMs use RMSNorm in place of full mean-and-variance normalization.
Context WindowThe context window is the maximum number of tokens a model can process in one forward pass. It defines the model's accessible working memory at inference time, and longer windows increase both usefulness on long documents and computational cost.
AutoregressionAutoregression is the factorization of a sequence distribution into a product of conditional next-step distributions. In language generation it means producing one token at a time, each conditioned on all previously generated tokens.
PromptA prompt is the text or structured input given to a language model to condition its behavior and output. It can provide instructions, examples, retrieved context, or tool schemas, and in practice it acts as the model's temporary task specification.
System PromptA system prompt is a high-priority instruction block that defines the assistant's role, rules, and behavioral constraints for a conversation. It is usually prepended invisibly to user messages and is intended to override lower-priority prompt content.
Prompting FormatPrompting format is the template used to serialize instructions, roles, examples, and conversation turns into the token sequence a model expects. It matters because the same words in a different format can change model behavior, especially for chat-tuned systems.
Few-Shot PromptingFew-shot prompting includes a small number of labeled examples in the prompt so the model can infer the task from context without updating parameters. It is one of the clearest demonstrations of in-context learning in large language models.
In-Context LearningIn-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
Prompt EngineeringPrompt engineering is the practice of designing prompts that make a model reliably produce the desired behavior. It includes choosing instructions, examples, structure, and reasoning scaffolds, and it trades parameter updates for careful interface design.
Chain of ThoughtChain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
Tree of ThoughtTree of Thought extends chain-of-thought by exploring multiple candidate reasoning paths, evaluating intermediate states, and searching over them with strategies such as BFS or DFS. It is useful when solving the task requires branching, backtracking, or comparing alternative partial plans.
Self-ConsistencySelf-consistency samples multiple reasoning traces for the same problem and chooses the most common final answer rather than trusting a single chain of thought. It often boosts accuracy because different samples make different mistakes, while the correct answer tends to recur.
ReAct (Reason + Act)ReAct is a prompting pattern where a model alternates between reasoning in text and taking actions such as search or tool calls. This lets it use external information and observations to update its plan instead of reasoning only from the original prompt.
Function CallingFunction calling is a language-model capability for producing structured tool invocations instead of only plain text. The model selects a function and arguments that match a schema, which makes tool use more reliable and easier to integrate with software systems.
Large Language Model (LLM)A large language model is a very large neural language model, usually with billions of parameters, pretrained on massive text corpora. Scale gives LLMs broad world knowledge and emergent capabilities such as in-context learning, but the core training objective is still language modeling.
PretrainingPretraining is the large-scale first stage of training where a model learns general-purpose representations from unlabeled or self-supervised data. For LLMs this usually means next-token prediction over massive corpora, producing a base model that later fine-tuning can adapt.
Supervised Fine-Tuning (SFT)Supervised fine-tuning trains a pretrained model on curated input-output pairs so it follows instructions, styles, or task formats more reliably. In chat systems, SFT is the stage that turns a raw completion model into an assistant before preference alignment is applied.
FinetuningFinetuning continues training a pretrained model on a smaller task-specific or domain-specific dataset. It adapts existing representations rather than learning from scratch, which is why it usually needs far less data and compute than pretraining.
Full Fine-TuneA full fine-tune updates all of a model's parameters on the new task or domain. It offers maximum flexibility, but it is much more memory- and compute-intensive than PEFT methods and produces a separate full checkpoint for each adapted model.
Parameter-Efficient Fine-Tuning (PEFT)PEFT is a family of fine-tuning methods that keep most pretrained weights frozen and train only a small number of added or selected parameters. It preserves much of full fine-tuning's quality while reducing memory, compute, and storage costs.
Low-Rank Adaptation (LoRA)LoRA fine-tunes a model by expressing each weight update as a low-rank product \( \Delta W = BA \) while keeping the original weight matrix frozen. This dramatically cuts trainable parameters and optimizer state, which is why LoRA became the default PEFT method for LLMs.
LoRA AdapterA LoRA adapter is the task-specific pair of low-rank matrices inserted around a frozen base weight matrix to produce a learned update at inference or training time. Because adapters are small, many tasks can be stored, swapped, and merged without copying the full base model.
QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision. This makes fine-tuning very large models feasible on modest hardware because the base weights stay compressed while only the small adapter parameters receive gradient updates.
Base ModelA base model is the pretrained model before instruction tuning, chat alignment, or task-specific fine-tuning. It is usually optimized only for language modeling, so it can complete text well but may not reliably follow user instructions or safety constraints.
Sampling (in Language Models)Sampling in language models means selecting the next token from the predicted probability distribution instead of always taking the argmax. The decoding rule strongly shapes diversity, coherence, and repetition, which is why temperature, top-k, and top-p matter so much.
Greedy DecodingGreedy decoding always selects the highest-probability next token at each step. It is simple and deterministic, but it often gets trapped in bland or repetitive continuations because it never explores slightly less probable alternatives that might lead to better sequences.
HallucinationHallucination is when a model produces content that is unsupported or false while presenting it as if it were correct. In language models it often comes from next-token training, weak grounding, or overconfident decoding rather than deliberate deception.
MisalignmentMisalignment is the failure mode where optimizing a model for its training objective or proxy reward does not produce the behavior humans actually want. It includes problems like reward hacking, unsafe shortcuts, and goal pursuit that diverges from the intended specification.
Bias (Fairness)In fairness contexts, bias means systematic differences in treatment or error rates across groups caused by data, labels, measurement, or deployment choices. Fairness asks which notion of equal treatment matters, and different fairness criteria often cannot all be satisfied at once.
ExplainabilityExplainability is the ability to give a human-understandable reason for a model’s prediction or behavior using features, examples, rules, or mechanisms. A good explanation should be useful to a person and, ideally, faithful to what the model actually used.
Retrieval-Augmented Generation (RAG)Retrieval-augmented generation adds a retrieval step so the model conditions on external documents at inference time instead of relying only on memorized parameters. It can improve freshness and grounding, but answer quality depends heavily on retrieval recall, ranking, chunking, and how well the model uses the retrieved evidence.
Semantic SearchSemantic search retrieves results by meaning rather than exact keyword overlap, usually by embedding queries and documents into a vector space and comparing similarity. It handles paraphrases well, but it is often combined with lexical search when exact terms or identifiers matter.
Sparse Mixture-of-Experts (MoE) LayerA sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
Router NetworkA router network scores experts or computation paths for each token and decides where that token should be sent in a conditional-compute model such as an MoE. A good router improves specialization while avoiding collapsed routing, overload, and excessive communication.
Expert NetworkAn expert network is one of the specialized submodules inside an MoE layer that processes only the tokens routed to it. Experts usually share the same architecture but learn different functions, so specialization emerges from routing plus load-balancing constraints.
Top-k RoutingTop-k routing sends each token only to the k highest-scoring experts instead of to every expert. This makes MoE computation sparse and efficient, but the choice of k trades off compute cost, robustness, and routing stability.
Load Balancing (MoE)Load balancing in MoE training adds losses or routing constraints so tokens are spread across experts instead of collapsing onto a few popular ones. It matters because uneven routing wastes capacity, creates bottlenecks, and leaves underused experts poorly trained.
Switch TransformerSwitch Transformer is a simplified MoE Transformer that routes each token to exactly one expert in each sparse feed-forward layer. Top-1 routing reduces communication and implementation complexity, enabling very large sparse models, but makes router stability and load balancing especially important.
Structured PruningStructured pruning removes whole channels, heads, layers, or blocks, producing regular sparsity that hardware can exploit directly. It usually yields better real-world speedups than unstructured pruning, though it gives less fine-grained control.
Unstructured PruningUnstructured pruning zeros individual weights wherever they appear unimportant, creating irregular sparsity patterns. It can achieve high compression ratios, but specialized kernels are often needed to turn that sparsity into real latency gains.
QuantizationQuantization stores or computes with lower-precision numbers, such as INT8 or 4-bit values, instead of full-precision floats. It reduces memory bandwidth and can speed inference, but accuracy depends on how well the lower-precision representation preserves weights and activations.
Post-Training Quantization (PTQ)Post-training quantization converts a trained model to lower precision after optimization is finished, usually using a calibration set to estimate activation ranges. It is easy and cheap to apply, but accuracy can drop more than with quantization-aware training at very low bit widths.
Quantization-Aware Training (QAT)Quantization-aware training simulates low-precision arithmetic during training so the model learns weights that remain accurate after quantization. It usually outperforms post-training quantization at low bit widths, but it adds training cost and implementation complexity.
Preference-Based AlignmentPreference-based alignment trains models from judgments such as ‘response A is better than response B’ instead of only from supervised targets. It is useful when desired behavior is easier for humans to compare than to specify as a single correct answer.
Reinforcement Learning from Human Feedback (RLHF)RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
Reward ModelA reward model predicts a scalar preference score for a candidate response, usually from pairwise human comparisons. In RLHF it acts as a learned proxy objective, so the policy can exploit its mistakes if optimization pushes too hard against it.
Self-CritiqueSelf-critique is a prompting or training pattern where a model reviews its own draft, identifies problems, and then revises the answer. It can improve reasoning and safety, but only when the model can recognize errors more reliably than it makes them.
Direct Preference Optimization (DPO)DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
Cross-AttentionCross-attention lets one sequence or modality attend to representations produced by another sequence or modality. In encoder-decoder models the decoder queries encoder states, and in multimodal models text tokens often query visual features the same way.
Vision EncoderA vision encoder maps an image into features or tokens that downstream modules can use for classification, retrieval, or generation. CNNs and Vision Transformers are common vision encoders, differing mainly in how they represent spatial structure.
CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
FlashAttentionFlashAttention is an exact attention algorithm that uses tiling and kernel fusion to minimize reads and writes between GPU HBM and on-chip SRAM. It preserves standard attention outputs while greatly reducing memory traffic, which yields large speed and memory gains on long sequences.
Tensor ParallelismTensor parallelism shards individual large matrix operations across devices, such as splitting weight matrices by rows or columns. It is effective for very large Transformers, but the frequent collectives mean fast interconnects are important.
Context ParallelismContext parallelism distributes a long sequence across devices so context tokens and their attention-related work are sharded instead of fully replicated. It helps long-context training or inference scale beyond one device, but requires extra communication to preserve exact attention across chunks.
Model ShardingModel sharding splits a model’s parameters across devices or storage tiers instead of keeping a full copy everywhere. It is a general systems technique used in tensor parallelism, FSDP, offloading, and large-model serving to reduce per-device memory requirements.
Floating-Point Operations (FLOPs)FLOPs count the number of floating-point arithmetic operations required by a model or workload. They are a useful compute proxy for comparing training or inference cost, though real speed also depends on memory traffic, parallelism, and hardware utilization.
Mixed Precision TrainingMixed-precision training performs most computation in lower precision such as FP16 or bfloat16 while keeping selected quantities in higher precision for stability. It reduces memory use and often increases throughput without much accuracy loss when implemented carefully.
Long-Context PretrainingLong-context pretraining trains or continues training a model on examples with much longer sequences so it learns to use distant context instead of only fitting short windows. It is usually needed because simply changing positional scaling or the context limit does not teach robust long-range retrieval or reasoning.
Needle in a HaystackNeedle in a Haystack is a long-context benchmark that tests whether a model can retrieve a small target fact embedded inside a large distractor context. It is useful for measuring position-sensitive retrieval, but strong needle scores do not guarantee broader long-document reasoning.
Positional EncodingPositional encoding injects token order information into architectures like Transformers whose attention is otherwise permutation-invariant. It can be absolute or relative, and the choice strongly affects extrapolation, long-context behavior, and inductive bias.
Absolute Position EncodingAbsolute position encoding assigns each sequence position its own encoding or embedding and combines it with token representations. It works well inside the trained context range, but it often extrapolates poorly because positions are treated as fixed IDs rather than relative distances.
Relative Position EncodingRelative position encoding represents how far apart tokens are rather than assigning each position a standalone ID. That lets attention depend on distance or offset, which often improves length generalization and transfers patterns more naturally across positions.
Token IDA token ID is the integer index assigned to a token after tokenization. Models do not operate on raw text directly; they look up embeddings from token IDs and later map output logits back to IDs during decoding.
Vocabulary SizeVocabulary size is the number of distinct tokens a tokenizer can emit. A larger vocabulary shortens sequences but increases embedding and softmax size, while a smaller vocabulary produces longer sequences and more token fragmentation.
Subword TokenizationSubword tokenization splits text into frequent pieces smaller than words but larger than individual characters. It handles rare words and open vocabularies well by composing unfamiliar words from known subword units.
Special TokensSpecial tokens are reserved tokens with structural or control meaning, such as BOS, EOS, PAD, SEP, or mask tokens, rather than ordinary text content. They shape formatting, training objectives, and sometimes model behavior.
Padding TokenA padding token is a dummy token added so sequences in a batch have equal length. It should be ignored by the loss and usually masked from attention so it does not behave like real context.
BOS Token (Beginning of Sequence)A BOS token marks the beginning of a sequence and gives the model a consistent start symbol for conditioning generation or encoding. It can help define sequence boundaries and sometimes carries special training semantics.
EOS Token (End of Sequence)An EOS token marks the end of a sequence and tells the model where generation should stop. During training it teaches sequence termination, and during inference it is one of the main stopping conditions.
Attention MechanismAttention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
Encoder-Decoder ArchitectureAn encoder-decoder architecture uses an encoder to turn an input sequence into representations and a decoder to generate an output sequence conditioned on those representations. It is the standard design for translation, summarization, and other input-to-output generation tasks.
Sequence-to-Sequence (Seq2Seq)Sequence-to-sequence learning maps one sequence to another, often with different lengths, such as translation or summarization. Modern seq2seq models are usually encoder-decoder Transformers, though earlier versions used recurrent networks with attention.
Layer NormalizationLayer normalization normalizes activations across features within each example rather than across the batch. It works well for variable-length sequences and small batch sizes, which is why it is standard in Transformers.
Gradient ClippingGradient clipping limits gradient norms or values before the optimizer step to prevent unstable updates and exploding gradients. It does not fix a bad objective, but it can stabilize training when rare large gradients would otherwise dominate.
Adam OptimizerAdam is an adaptive first-order optimizer that keeps moving averages of the gradient and its square, then bias-corrects them to scale each parameter’s update. It converges quickly and is standard for Transformer training, though it is sensitive to weight decay design and hyperparameters.
Learning Rate ScheduleA learning-rate schedule changes the learning rate over training instead of keeping it constant. Schedules matter because they balance fast early progress with stable late optimization and often determine final performance as much as the base optimizer.
WarmupWarmup starts training with a small learning rate and gradually increases it during the first steps. It reduces early instability, especially in Transformers where large updates before optimizer statistics settle can cause divergence.
Cosine AnnealingCosine annealing decays the learning rate following a cosine curve from a high value to a low value, sometimes with restarts. It provides a smooth schedule that often works well in practice without needing many hand-tuned decay boundaries.
Weight InitializationWeight initialization chooses starting parameter values before training begins. Good initialization keeps activations and gradients in useful ranges so learning can start without vanishing, exploding, or breaking symmetry.
Label SmoothingLabel smoothing replaces hard one-hot targets with a mostly-correct probability distribution that assigns a small amount of mass to other classes. It regularizes overconfident classifiers and often improves generalization and calibration, though it can hurt when exact probabilities matter.
GPU AccelerationGPU acceleration uses highly parallel graphics processors to speed the matrix and tensor operations that dominate modern ML workloads. It matters because deep learning is mostly throughput-bound linear algebra, which GPUs execute far more efficiently than general-purpose CPUs.
Scaling LawsScaling laws are empirical relationships showing how loss or capability changes with model size, data, and compute, often following approximate power laws. They matter because they let researchers forecast returns to scale and choose more compute-efficient training regimes.
What are emergent capabilities in large language models?Emergent capabilities in large language models are abilities that look weak or absent at small scale but become strong once the model is large enough. The key caveat is that “emergence” can depend on the metric and threshold used, so apparent jumps are not always literal discontinuities in the underlying capability.
Zero-Shot LearningZero-shot learning is the ability to perform a task from a description alone, without task-specific training examples in the prompt or fine-tuning data. In LLMs it is a direct consequence of broad pretraining and instruction-following ability.
Softmax HeadA softmax head is the output projection plus softmax normalization that converts hidden representations into a probability distribution over classes or vocabulary items. In language models it is the layer that turns the final hidden state into next-token probabilities.
Beam SearchBeam search is a decoding algorithm that keeps the top-scoring partial sequences at each step instead of only the single best one. It approximates high-probability generation better than greedy decoding, but it can still miss the global optimum and often reduces diversity.
Repetition PenaltyA repetition penalty is a decoding heuristic that downweights tokens or phrases the model has already used, reducing loops and bland repetition. It improves generation quality when a model is prone to degeneracy, but too much penalty can make text unnatural or incoherent.
Encoder (Transformer)A Transformer encoder is a stack of self-attention and feed-forward blocks that builds contextual representations of an input sequence. Because encoder self-attention is usually bidirectional, it is well suited for understanding tasks such as classification, retrieval, and sequence labeling.
Decoder (Transformer)A Transformer decoder is the autoregressive half of the architecture that predicts tokens using causal self-attention and, in encoder-decoder models, optional cross-attention to an encoder output. Its defining constraint is that each position can attend only to earlier positions when generating.
Steering VectorsSteering vectors are directions in activation space that, when added to hidden states, systematically change model behavior toward traits such as refusal, sentiment, or persona. They are useful because they show that some behaviors can be modified directly in representation space without full retraining.
Activation PatchingActivation patching is a causal analysis method where activations from one run are inserted into another to test which components matter for a given behavior. If patching a layer or head restores the behavior, that component is evidence for being on the relevant causal path.
Linear AttentionLinear attention is the family of attention mechanisms that rewrites or approximates softmax attention so sequence processing scales roughly linearly instead of quadratically with length. The benefit is efficiency on long contexts, but the tradeoff is that exact softmax behavior is usually lost.
ALiBi (Attention with Linear Biases)ALiBi is a positional method that adds head-specific linear distance penalties directly to attention logits instead of injecting separate position embeddings. Because the bias is built into the score function, models trained with ALiBi often extrapolate to longer contexts better than models tied to a fixed embedding table.
YaRN / NTK-aware RoPE ScalingYaRN and other NTK-aware RoPE-scaling methods extend the usable context of RoPE-based models by rescaling or interpolating rotary frequencies rather than retraining the model from scratch. Their goal is to preserve short-context behavior while making long-range positions less distorted.
Sliding-Window AttentionSliding-window attention restricts each token to attending only within a local context window rather than the entire sequence. This reduces compute and memory from full-context attention and is effective when most useful dependencies are nearby, though it can miss long-range interactions unless combined with global mechanisms.
Multi-head Latent Attention (MLA)Multi-head Latent Attention compresses the keys and values of multi-head attention into a smaller latent representation before use. Its main advantage is a much smaller KV cache and lower decode-time memory bandwidth, which is why it is attractive for long-context serving.
Auxiliary Load-Balancing Loss (MoE)The auxiliary load-balancing loss in a Mixture-of-Experts model encourages the router to spread tokens more evenly across experts. Without it, routing often collapses onto a few experts, which wastes capacity and creates severe hot spots in both learning and systems performance.
vLLM & Continuous BatchingvLLM is an LLM serving system built around PagedAttention and continuous batching. Instead of waiting for a batch to finish, it admits and schedules requests at each decoding step, which reduces padding waste and improves throughput for variable-length generations.
Attention Sinks / StreamingLLMAttention sinks are the first few tokens in a causal Transformer that absorb disproportionate attention from later positions, even when they carry little semantic content. StreamingLLM exploits this by keeping sink tokens and a short recent window in the KV cache, enabling long streaming inference with bounded memory.
Circuit AnalysisThe mechanistic-interpretability practice of identifying subgraphs of weights, residual-stream components, and attention heads that jointly implement a human-interpretable algorithm (indirect object identification, modular addition, greater-than). Circuit analysis produces falsifiable, causal accounts of what a network has learned.
ROME / MEMIT Model EditingRank-one edits to MLP weights that inject a single fact (ROME) or thousands of facts (MEMIT) into a pretrained LLM without retraining. They exploit the observation that MLP blocks act as key–value memories, identify the causal neurons via activation patching, and solve a closed-form optimisation problem for the minimal-norm weight update.
Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
Diffusion Transformers (DiT)Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
Transformer-XL / Segment-Level RecurrenceDai et al. (2019) extend Transformers beyond fixed context by caching hidden states of the previous segment and allowing attention to read from them — a simple "segment-level recurrence" that gives an effective receptive field of \( N \cdot L \) for \( L \) layers and segment length \( N \). Paired with relative positional encoding, it was a key bridge between pure attention and long-context models.
Longformer / BigBird (Sparse Long-Context Attention)Fixed sparsity patterns that reduce attention from \( O(n^2) \) to \( O(n) \) for long documents. Longformer combines sliding-window + global attention; BigBird adds random attention and proves the result retains full-attention universal-approximation properties. Both were pre-2022 answers to scaling Transformers to 4k–16k tokens.
RetNet / Retention NetworksSun et al. (2023) introduce a Transformer-alternative block whose retention operator admits three equivalent forms: parallel (for training), recurrent (for \( O(1) \) inference per token), and chunkwise-recurrent (for long-sequence training). RetNet aims for RNN-like inference cost with Transformer-like parallelisable training.
RWKVAn RNN-Transformer hybrid (Peng et al., 2023) whose block is a parallelisable linear-attention operation at training time and a simple recurrent state update at inference time. RWKV scales to 14B+ parameters with Transformer-competitive perplexity, offering constant-memory inference.
Memory-Augmented TransformersTransformer variants that extend effective context length with an external memory — Recurrent Memory Transformers (RMT) pass summary tokens across chunks, Memorizing Transformers retrieve past kNN keys, Infini-attention compresses the tail of context into a linear-attention state. A bridge between fixed-context Transformers and sequence models with unbounded memory.
In-Context Learning MechanismsIn-context learning (ICL) is the empirical phenomenon that a frozen LLM solves new tasks from few-shot examples in the prompt. Mechanistic studies show ICL is implemented by a small set of attention circuits — induction heads, function vectors, and implicit gradient-descent-like updates — that emerge during pretraining once the data and depth budget cross a threshold.
Memory in LLMsAn LLM's working memory is the KV-cache for the current context window; longer-term memory is implemented externally by retrieval over vector / hybrid stores, by writing to scratchpads or tool state, or by parameter updates (continual fine-tuning). Each option has a different latency, capacity, and forgetting profile; production systems combine all three.
Vision Transformer (ViT) VariantsSince the original ViT, a wide family of variants has emerged that improve data efficiency, locality, hierarchy, and pretraining objective. The most influential are DeiT (training recipe), Swin (windowed hierarchical attention), MAE (masked-image pretraining), DINOv2 (self-distilled features), and SigLIP (sigmoid contrastive pretraining). Each addresses a specific weakness of the vanilla ViT.
Graph TransformersGraph Transformers apply self-attention over graph-structured data, with positional encodings (Laplacian eigenvectors, random walks, shortest-path distances) injecting graph topology that vanilla attention lacks. They generalise message-passing GNNs and have become the leading architecture for molecular property prediction, code understanding, and combinatorial optimisation.
Perceiver ArchitectureThe Perceiver uses cross-attention from a small latent array to a potentially very large input, then performs most computation in latent space. This decouples cost from input length and makes one architecture usable across images, audio, video, and other modalities.
Modular Neural NetworksA modular network composes specialised sub-networks (modules) under a routing or composition rule that decides which modules process each input. Mixture-of-Experts is the most successful instance, but the family includes routed adapter networks, modular meta-learners, and compositional architectures designed for systematic generalisation. The motivation is parameter efficiency and reusable skills.
Compressed Sparse Attention (CSA)Compressed Sparse Attention (CSA) is a long-context attention scheme that first compresses the KV cache into block summaries and then performs sparse attention only over the top-k relevant compressed blocks. An added sliding-window branch preserves exact local dependencies, so CSA cuts both KV memory and long-context attention compute without collapsing into a purely local window.
Decision TransformersDecision Transformers cast offline reinforcement learning as conditional sequence modeling: a Transformer predicts the next action from past returns-to-go, states, and actions. This avoids explicit Bellman backups and instead treats policy learning like autoregressive imitation conditioned on the desired future return.
Serving LLMs at ScaleServing LLMs at scale is a systems problem of jointly optimizing prompt prefill throughput, token-by-token decode latency, KV-cache memory, batching policy, and fleet utilization. Modern serving stacks rely on continuous batching, prefix caching, PagedAttention, speculative decoding, and sometimes prefill/decode disaggregation to keep both tail latency and GPU cost under control.
RLHF as KL-Regularized Policy OptimizationA deeper theoretical view of RLHF treats post-training as optimizing a policy against a learned reward while regularizing toward a reference model with a KL penalty. This viewpoint explains why PPO-RLHF, reward-model training, and even DPO-style objectives are closely related: they are different ways of solving or approximating the same regularized preference-optimization problem.
Attention Is All You Need“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.