Tag: nlp

173 topic(s)

T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
ELMo (Embeddings from Language Models)ELMo produces contextualized word embeddings by taking a learned task-specific combination of hidden states from a pretrained bidirectional LSTM language model. Unlike static embeddings such as word2vec or GloVe, it gives the same word different vectors in different sentence contexts.
Key-Value Memory NetworksKey-Value Memory Networks store each memory slot as a key for retrieval and a separate value for the returned content. This decouples matching from payload and is a direct conceptual precursor to modern query-key-value attention.
Luong Attention (Global and Local)Luong attention is a sequence-to-sequence attention mechanism that scores decoder states against encoder states using multiplicative forms such as dot or bilinear attention. It distinguishes global attention over all source positions from local attention over a predicted window, helping make neural machine translation more scalable.
GloVe Word EmbeddingsGloVe learns word embeddings by fitting vector dot products to the log of global word-word co-occurrence counts. Because it is trained on ratios of co-occurrence statistics, linear relations such as king minus man plus woman approximately equals queen often emerge in the embedding space.
Neural Probabilistic Language ModelThe Neural Probabilistic Language Model replaced count-based n-grams with learned word embeddings and a neural network that predicts the next word from a continuous representation of context. Its core contribution was showing that distributed representations let language models generalize to unseen but similar word sequences.
KL-Divergence Penalty in RLHFThe KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
Next-Token Prediction Objective (Causal Language Modeling)Next-token prediction trains a causal language model to assign high probability to each token given all previous tokens. Maximizing this likelihood over large text corpora teaches the model syntax, facts, and reusable patterns that later support prompting and generation.
Byte Pair Encoding (BPE)Byte Pair Encoding is a subword tokenization method that repeatedly merges the most frequent adjacent symbols in a corpus. It builds a vocabulary between characters and whole words, which handles rare words better than word-level tokenization while keeping sequence lengths manageable.
Softmax TemperatureSoftmax temperature rescales logits before softmax to control randomness in the output distribution. Lower temperature makes probabilities sharper and decoding more deterministic, while higher temperature flattens the distribution and increases diversity.
Hidden Markov Model (HMM)A Hidden Markov Model is a sequence model with an unobserved Markov chain of states and an observed emission distribution from each state. It became a standard model for speech, tagging, and other structured sequence tasks because dynamic programming can efficiently infer likely states and sequence probabilities.
Recurrent neural network (RNN)A recurrent neural network processes sequences by maintaining a hidden state that is updated one step at a time from the current input and previous state. This gives it a notion of temporal memory, but plain RNNs are hard to train on long dependencies because gradients can vanish or explode.
Elman RNNAn Elman RNN is the classic simple recurrent network in which the next hidden state is a nonlinear function of the current input and previous hidden state. It introduced the basic hidden-state recurrence used by later gated models, but long-range memory is poor without gating.
Hidden stateThe hidden state is the internal representation a sequential model carries forward as it processes inputs over time. In an RNN or LSTM it summarizes relevant past context, and in broader neural architectures it usually means a layer's intermediate activation vector.
Long short-term memory (LSTM)Long short-term memory is a gated recurrent architecture designed to preserve information over long timescales. Its input, forget, and output gates regulate a cell state with near-linear self-connections, which helps prevent the vanishing-gradient behavior of simple RNNs.
xLSTMxLSTM is a family of modern LSTM variants that adds exponential gating and redesigned memory structures, including scalar-memory and matrix-memory forms, to make recurrent models more scalable. The goal is to keep LSTM-style recurrence while improving stability, parallelism, and long-context performance.
minLSTMminLSTM is a simplified LSTM variant designed to remove some of the sequential dependencies that make classical LSTMs expensive while keeping useful gating behavior. The result is a lighter recurrent block that can be trained more efficiently and scaled more easily.
Gated recurrent unit (GRU)A gated recurrent unit is a recurrent architecture that uses update and reset gates to control how much past information is kept and how much new input is written into the hidden state. It is simpler than an LSTM because it has no separate cell state, yet it often achieves similar sequence-modeling performance.
Embedding layerAn embedding layer maps discrete IDs such as words, subwords, or items to learned dense vectors. It is essential whenever symbolic inputs must be represented in a continuous space that gradient-based models can manipulate.
Embedding vectorAn embedding vector is the dense continuous representation assigned to a discrete token, item, or entity by an embedding table or model. Its meaning comes from geometry: similar entities tend to occupy nearby directions or neighborhoods in the learned space.
Word embeddingA word embedding is a dense vector representation of a word learned from distributional context rather than hand-coded features. Its purpose is to place semantically or syntactically related words near one another in vector space so downstream models can generalize across vocabulary items.
Word2VecWord2Vec is a family of shallow neural methods that learn word embeddings from local context, most famously via the skip-gram and CBOW objectives. Its importance is that simple predictive training on large text corpora produced useful semantic geometry, including analogy-like linear regularities.
Skip-gramSkip-gram trains a model to predict surrounding context words from a center word. It learns embeddings that are especially good for capturing rare-word semantics because each observed word directly becomes a prediction source for many context targets.
FastTextFastText extends Word2Vec by representing a word as a bag of character n-gram embeddings rather than as a single atomic vector. That lets it model morphology and produce reasonable embeddings for rare or even unseen words.
Semantic similaritySemantic similarity is the degree to which two words, sentences, or documents share meaning rather than just surface form. In machine learning it is often estimated with embeddings and cosine similarity, which turns meaning comparison into a geometric problem.
Cosine similarityCosine similarity measures the angle between two vectors: \( \cos \theta = x \cdot y / (\|x\| \|y\|) \). It ignores magnitude and compares direction, which is why it is the default similarity metric for embeddings in retrieval, clustering, and semantic search.
Bag of wordsBag of words represents a document by counts or weights of vocabulary terms while discarding word order and syntax. It is simple, sparse, and historically central to information retrieval and document classification, but it cannot distinguish sentences with the same words in different orders.
Document-term matrixA document-term matrix is a matrix whose rows are documents, columns are vocabulary terms, and entries are counts or weights such as TF-IDF. It is the core data structure behind bag-of-words retrieval, topic modeling, and many classical NLP pipelines.
TF-IDFTF-IDF weights a term by how frequent it is in a document and how rare it is across the corpus, typically \( tf(w,d) \log(N/df(w)) \). It downweights ubiquitous words and highlights terms that are especially informative for a given document.
Dense vectorA dense vector is a low- or moderate-dimensional representation in which most entries are nonzero. Dense vectors are usually learned embeddings, so they capture semantic similarity better than sparse count vectors but are harder to interpret directly.
Sparse vectorA sparse vector has very few nonzero entries relative to its dimensionality. Classical text features such as bag-of-words and TF-IDF are sparse, which makes them memory-efficient and interpretable even when the feature space is huge.
One-hot encodingOne-hot encoding represents a categorical variable as a binary vector with exactly one 1 and all other entries 0. It preserves category identity without implying any ordering, but its dimensionality grows linearly with the number of categories.
TokenizationTokenization is the process of splitting raw text into model-readable tokens such as words, subwords, bytes, or characters. It determines vocabulary size, sequence length, and how efficiently a language model handles rare words, multilingual text, and code.
TokenA token is the discrete unit a language model reads and predicts. Depending on the tokenizer, a token may be a word, subword, byte, punctuation mark, or special control symbol, and token count determines both context usage and API cost.
SubwordA subword is a token unit smaller than a full word but larger than a character, learned to balance vocabulary size against sequence length. Subword tokenization lets models handle rare and novel words by composing them from reusable pieces.
VocabularyA vocabulary is the fixed set of tokens a tokenizer can map text into and a model can natively process. Its size trades off compression against flexibility: larger vocabularies shorten sequences, while smaller ones rely more on subword or byte composition.
CorpusA corpus is a structured collection of text used to train, fine-tune, or evaluate language models. Its size, quality, domain mix, and cleaning decisions strongly shape what a model knows, how it generalizes, and which biases it inherits.
N-gramAn n-gram is a contiguous sequence of \( n \) tokens, such as a bigram for \( n=2 \) or trigram for \( n=3 \). N-grams are the basic units of classical language models and many text features because they capture short-range local context.
Count-based language modelA count-based language model estimates sequence probabilities from n-gram counts in a corpus, then uses smoothing or backoff for unseen events. It was the dominant pre-neural approach to language modeling, but it struggles with long context and data sparsity.
Language modelA language model assigns probabilities to token sequences, or equivalently predicts missing or next tokens from context. This unifies classical n-gram models, masked models like BERT, and autoregressive LLMs such as GPT under one probabilistic framework.
Autoregressive language modelAn autoregressive language model generates text left-to-right by modeling \( P(w_t \mid w_{<t}) \) for each token. Because it only conditions on past tokens, it can be used directly for open-ended generation as well as scoring sequences.
Masked language modelA masked language model is trained to recover tokens hidden within a sequence using both left and right context. This bidirectional training makes MLMs strong encoders for understanding tasks, but less natural than autoregressive models for direct generation.
Causal language modelA causal language model predicts each token using only earlier tokens, enforced by a causal attention mask. It is essentially the same modeling family as an autoregressive language model, with the word 'causal' emphasizing the masking constraint in self-attention.
Chat language modelA chat language model is a pretrained LLM further tuned to follow instructions and handle multi-turn dialogue. It is usually built by supervised fine-tuning plus preference optimization or RLHF, so it behaves more helpfully and safely than the raw base model.
PerplexityPerplexity is the exponentiated average negative log-likelihood of a test sequence, so lower perplexity means the model is less surprised by the data. It is a standard intrinsic metric for language models, though low perplexity does not guarantee downstream usefulness.
Laplace smoothingLaplace smoothing adds a small constant, often 1, to every discrete count before normalizing probabilities. It prevents zero-probability events in models such as naive Bayes and n-gram LMs, though it can over-smooth when the vocabulary is large.
Backoff (N-gram backoff)Backoff is an n-gram smoothing strategy that uses a high-order estimate when it has enough evidence and otherwise falls back to a lower-order n-gram. It handles sparsity by preferring specific context when available without assigning zero probability to unseen sequences.
Zipf's lawZipf's law says a word's frequency is roughly inversely proportional to its rank in the frequency table. This heavy-tailed structure explains why a few tokens dominate corpora, why vocabularies keep growing with more data, and why tokenization and smoothing are central in NLP.
ROUGEROUGE is a family of overlap metrics for summarization and generation, based on matching n-grams, longest common subsequences, or skip-bigrams between a candidate and reference text. It measures lexical recall more than semantic faithfulness, so it is informative but limited.
Edit DistanceEdit distance is the minimum number of insertions, deletions, and substitutions needed to transform one sequence into another. The most common version, Levenshtein distance, is a dynamic-programming measure of string similarity used in spelling correction, alignment, and evaluation.
AutoregressionAutoregression is the factorization of a sequence distribution into a product of conditional next-step distributions. In language generation it means producing one token at a time, each conditioned on all previously generated tokens.
PromptA prompt is the text or structured input given to a language model to condition its behavior and output. It can provide instructions, examples, retrieved context, or tool schemas, and in practice it acts as the model's temporary task specification.
System PromptA system prompt is a high-priority instruction block that defines the assistant's role, rules, and behavioral constraints for a conversation. It is usually prepended invisibly to user messages and is intended to override lower-priority prompt content.
Prompting FormatPrompting format is the template used to serialize instructions, roles, examples, and conversation turns into the token sequence a model expects. It matters because the same words in a different format can change model behavior, especially for chat-tuned systems.
Few-Shot PromptingFew-shot prompting includes a small number of labeled examples in the prompt so the model can infer the task from context without updating parameters. It is one of the clearest demonstrations of in-context learning in large language models.
In-Context LearningIn-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
Prompt EngineeringPrompt engineering is the practice of designing prompts that make a model reliably produce the desired behavior. It includes choosing instructions, examples, structure, and reasoning scaffolds, and it trades parameter updates for careful interface design.
Chain of ThoughtChain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
Tree of ThoughtTree of Thought extends chain-of-thought by exploring multiple candidate reasoning paths, evaluating intermediate states, and searching over them with strategies such as BFS or DFS. It is useful when solving the task requires branching, backtracking, or comparing alternative partial plans.
Self-ConsistencySelf-consistency samples multiple reasoning traces for the same problem and chooses the most common final answer rather than trusting a single chain of thought. It often boosts accuracy because different samples make different mistakes, while the correct answer tends to recur.
ReAct (Reason + Act)ReAct is a prompting pattern where a model alternates between reasoning in text and taking actions such as search or tool calls. This lets it use external information and observations to update its plan instead of reasoning only from the original prompt.
Function CallingFunction calling is a language-model capability for producing structured tool invocations instead of only plain text. The model selects a function and arguments that match a schema, which makes tool use more reliable and easier to integrate with software systems.
Large Language Model (LLM)A large language model is a very large neural language model, usually with billions of parameters, pretrained on massive text corpora. Scale gives LLMs broad world knowledge and emergent capabilities such as in-context learning, but the core training objective is still language modeling.
PretrainingPretraining is the large-scale first stage of training where a model learns general-purpose representations from unlabeled or self-supervised data. For LLMs this usually means next-token prediction over massive corpora, producing a base model that later fine-tuning can adapt.
Supervised Fine-Tuning (SFT)Supervised fine-tuning trains a pretrained model on curated input-output pairs so it follows instructions, styles, or task formats more reliably. In chat systems, SFT is the stage that turns a raw completion model into an assistant before preference alignment is applied.
FinetuningFinetuning continues training a pretrained model on a smaller task-specific or domain-specific dataset. It adapts existing representations rather than learning from scratch, which is why it usually needs far less data and compute than pretraining.
Base ModelA base model is the pretrained model before instruction tuning, chat alignment, or task-specific fine-tuning. It is usually optimized only for language modeling, so it can complete text well but may not reliably follow user instructions or safety constraints.
Open-Weight ModelAn open-weight model is a model whose trained weights are publicly released for download and local use. That is more specific than 'open source': the weights may be open even when the training data, code, or full recipe are not.
Sampling (in Language Models)Sampling in language models means selecting the next token from the predicted probability distribution instead of always taking the argmax. The decoding rule strongly shapes diversity, coherence, and repetition, which is why temperature, top-k, and top-p matter so much.
Greedy DecodingGreedy decoding always selects the highest-probability next token at each step. It is simple and deterministic, but it often gets trapped in bland or repetitive continuations because it never explores slightly less probable alternatives that might lead to better sequences.
Top-k SamplingTop-k sampling truncates the next-token distribution to the \( k \) most probable tokens, renormalizes, and samples from that set. It removes the low-probability tail that often contains junk while still allowing controlled randomness.
Top-p Sampling (Nucleus Sampling)Top-p, or nucleus, sampling chooses the smallest set of tokens whose cumulative probability exceeds a threshold \( p \), then samples from that adaptive set. Unlike top-k, it expands when the model is uncertain and shrinks when the distribution is sharp.
Frequency PenaltyA frequency penalty subtracts an amount proportional to how many times a token has already appeared, lowering its future logit more with each repetition. It encourages lexical diversity without banning reuse entirely, which makes it gentler than hard repetition constraints.
Presence PenaltyA presence penalty lowers the score of any token that has already appeared, encouraging the model to introduce new words or topics instead of repeating earlier ones. Unlike a frequency penalty, it depends only on whether the token has appeared at least once, not on how many times it appeared.
HallucinationHallucination is when a model produces content that is unsupported or false while presenting it as if it were correct. In language models it often comes from next-token training, weak grounding, or overconfident decoding rather than deliberate deception.
Retrieval-Augmented Generation (RAG)Retrieval-augmented generation adds a retrieval step so the model conditions on external documents at inference time instead of relying only on memorized parameters. It can improve freshness and grounding, but answer quality depends heavily on retrieval recall, ranking, chunking, and how well the model uses the retrieved evidence.
Semantic SearchSemantic search retrieves results by meaning rather than exact keyword overlap, usually by embedding queries and documents into a vector space and comparing similarity. It handles paraphrases well, but it is often combined with lexical search when exact terms or identifiers matter.
Preference-Based AlignmentPreference-based alignment trains models from judgments such as ‘response A is better than response B’ instead of only from supervised targets. It is useful when desired behavior is easier for humans to compare than to specify as a single correct answer.
Reward ModelA reward model predicts a scalar preference score for a candidate response, usually from pairwise human comparisons. In RLHF it acts as a learned proxy objective, so the policy can exploit its mistakes if optimization pushes too hard against it.
Self-CritiqueSelf-critique is a prompting or training pattern where a model reviews its own draft, identifies problems, and then revises the answer. It can improve reasoning and safety, but only when the model can recognize errors more reliably than it makes them.
Constitutional AIConstitutional AI aligns a model using an explicit list of principles that guide critique and revision, reducing the need for dense human feedback on every example. The constitution acts like a rule set for self-improvement, though the resulting behavior still depends on the chosen principles and training procedure.
Direct Preference Optimization (DPO)DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
RankingRanking is the task of ordering items by relevance, preference, or utility rather than predicting a single class label. It appears in search, recommendation, and alignment because the main question is which outputs should be placed above others.
Program-Aided Language ModelA program-aided language model uses the LLM to translate a problem into executable code, then lets an interpreter carry out the exact computation. This separates natural-language understanding from symbolic execution and often improves arithmetic or algorithmic reasoning over pure chain-of-thought.
Subword TokenizationSubword tokenization splits text into frequent pieces smaller than words but larger than individual characters. It handles rare words and open vocabularies well by composing unfamiliar words from known subword units.
Special TokensSpecial tokens are reserved tokens with structural or control meaning, such as BOS, EOS, PAD, SEP, or mask tokens, rather than ordinary text content. They shape formatting, training objectives, and sometimes model behavior.
Padding TokenA padding token is a dummy token added so sequences in a batch have equal length. It should be ignored by the loss and usually masked from attention so it does not behave like real context.
BOS Token (Beginning of Sequence)A BOS token marks the beginning of a sequence and gives the model a consistent start symbol for conditioning generation or encoding. It can help define sequence boundaries and sometimes carries special training semantics.
EOS Token (End of Sequence)An EOS token marks the end of a sequence and tells the model where generation should stop. During training it teaches sequence termination, and during inference it is one of the main stopping conditions.
Sequence-to-Sequence (Seq2Seq)Sequence-to-sequence learning maps one sequence to another, often with different lengths, such as translation or summarization. Modern seq2seq models are usually encoder-decoder Transformers, though earlier versions used recurrent networks with attention.
Embedding SpaceAn embedding space is the vector space produced by an embedding model, where tokens, sentences, images, or other objects are mapped to dense numerical representations. Similarity in that space is used for retrieval, clustering, and transfer, though the geometry depends on the training objective.
Vector DatabaseA vector database is a system optimized for storing embeddings and retrieving nearest neighbors together with metadata filtering, updates, and persistence. It is the common serving layer behind semantic search and many RAG systems.
BM25BM25 is a sparse retrieval scoring function that ranks documents using term matches weighted by inverse document frequency and document-length normalization. It remains strong for exact lexical search and is often combined with dense retrieval in hybrid systems.
Sparse RetrievalSparse retrieval represents queries and documents with sparse term-based features such as inverted indexes, TF-IDF, or BM25. It excels at exact keywords and rare identifiers, but is weaker than dense retrieval on paraphrases and semantic matching.
Dense RetrievalDense retrieval represents queries and documents with learned dense embeddings and retrieves by vector similarity. It handles paraphrase and semantic matching better than sparse retrieval, but it can miss exact lexical constraints and usually relies on approximate nearest neighbor search.
GroundingGrounding means tying a model’s answer to external evidence, inputs, or world state rather than letting it generate from unsupported priors alone. In RAG or tool-use systems, grounding is what makes outputs traceable to retrieved context or observations.
FactualityFactuality is whether the content of an answer is actually true in the world or according to trusted sources. An answer can be fluent and even faithful to its source while still being nonfactual if the source itself is wrong or outdated.
FaithfulnessFaithfulness is whether a model’s output is supported by the provided input, source document, or chain of evidence. It differs from factuality because a summary can be perfectly faithful to a source that contains false claims.
Tokenization PipelineA tokenization pipeline is the full process that turns raw text into model-ready inputs, including normalization, pre-tokenization, subword splitting, token-to-ID mapping, truncation, padding, and special-token insertion. Choices here directly affect sequence length, vocabulary coverage, and downstream behavior.
Instruction TuningInstruction tuning is supervised fine-tuning on instruction-response examples so a pretrained model learns to follow requests instead of merely continuing text. It improves task generality and usability, but it mainly changes behavior and format-following rather than adding much new world knowledge.
Safety AlignmentSafety alignment is the process of making a model reliably avoid harmful, deceptive, or policy-violating behavior while remaining useful. In practice it combines data curation, supervised tuning, preference optimization or RLHF, classifiers, and adversarial evaluation, but it never guarantees perfect safety.
Red-TeamingRed-teaming is adversarial evaluation in which people or automated systems deliberately try to break a model’s safeguards and expose failure modes. Its purpose is not to improve benchmark scores directly, but to find unsafe or brittle behavior before deployment.
What is a jailbreak in the context of LLMs?In the context of LLMs, a jailbreak is a prompt or interaction pattern that bypasses the model’s safety training or policy enforcement and elicits behavior it was supposed to refuse. Jailbreaks matter because they reveal that aligned behavior can be a thin behavioral layer rather than a deep guarantee.
Adversarial PromptingAdversarial prompting is the deliberate construction of inputs that push a model toward incorrect, unsafe, or unintended behavior. It includes jailbreaks, prompt injection, data exfiltration attempts, and other attacks that exploit weaknesses in instruction-following or context handling.
Human EvaluationHuman evaluation uses people to judge outputs on qualities such as helpfulness, factuality, coherence, or safety that automated metrics often miss. It is usually the most trustworthy evaluation for subjective tasks, but it is expensive, slow, and sensitive to rubric design and annotator variance.
Automated EvaluationAutomated evaluation scores model outputs with metrics or model-based judges instead of human raters. It is fast, scalable, and reproducible, but its usefulness depends on how well the metric correlates with the human judgment that actually matters.
BLEUBLEU is a machine-translation metric based mainly on n-gram precision against one or more reference texts, combined with a brevity penalty. It is useful for corpus-level comparison, but it often misses meaning-preserving paraphrases and is weak as a sentence-level quality measure.
Pretraining CorpusA pretraining corpus is the large unlabeled dataset used to train a model’s base capabilities through self-supervised objectives such as next-token prediction. Its size, quality, duplication rate, domain mix, and filtering choices strongly shape what the model knows and how it behaves.
Instruction DatasetAn instruction dataset is a curated set of prompts paired with preferred responses used to teach a pretrained model how to behave as an assistant. It mainly teaches task framing, format, and interaction style rather than the broad world model that comes from pretraining.
Preference DatasetA preference dataset contains prompts with ranked, paired, or binary-labeled responses indicating which outputs are preferred. It is the standard supervision source for reward modeling and direct preference objectives because it expresses comparative quality better than a single gold answer.
Probing (Neural Networks)Probing tests whether information is encoded in a model’s hidden states by training a simple classifier or regressor on those representations. A successful probe shows that the information is recoverable, but not necessarily that the model causally uses it.
Zero-Shot LearningZero-shot learning is the ability to perform a task from a description alone, without task-specific training examples in the prompt or fine-tuning data. In LLMs it is a direct consequence of broad pretraining and instruction-following ability.
Beam SearchBeam search is a decoding algorithm that keeps the top-scoring partial sequences at each step instead of only the single best one. It approximates high-probability generation better than greedy decoding, but it can still miss the global optimum and often reduces diversity.
Repetition PenaltyA repetition penalty is a decoding heuristic that downweights tokens or phrases the model has already used, reducing loops and bland repetition. It improves generation quality when a model is prone to degeneracy, but too much penalty can make text unnatural or incoherent.
Neural Language ModelA neural language model predicts text with learned distributed representations and a neural network rather than count tables. Its main advantage over classical n-gram models is that it can generalize to unseen contexts by sharing statistical strength across similar words and patterns.
Semantic SpaceA semantic space is an embedding space in which geometric relations reflect meaning, similarity, or functional role. Nearby points tend to correspond to semantically related items, which is why vector search and representation learning work at all.
Information RetrievalInformation retrieval is the problem of finding and ranking the documents, passages, or items most relevant to a query. Modern systems combine lexical matching, learned embeddings, and ranking models because exact term overlap and semantic similarity each capture different kinds of relevance.
Domain-Specific PretrainingDomain-specific pretraining continues or repeats pretraining on corpus data from a specialized domain such as law, medicine, or code. It improves vocabulary use, factual recall, and style in that domain, but it can also narrow the model or erode performance outside the target distribution.
Conversational AIConversational AI is a class of systems designed for multi-turn interaction, where the model must respond helpfully while tracking context, intent, and dialogue state. The hard part is not generating one good answer, but remaining coherent and useful across an extended interaction.
Role-Playing (LLMs)Role-playing in LLMs means conditioning a model to adopt a persona, voice, or behavioral frame during generation. It is useful for simulation and product design, but it also shows how easily high-level behavior can be steered by prompt context.
Data MixtureA data mixture is the weighting and composition of different datasets or domains in a training run. It matters because capability, robustness, and bias often depend as much on what proportion of the data comes from each source as on the total token count.
Pretraining Data DeduplicationPretraining data deduplication removes near-duplicate documents or passages from a training corpus. It improves per-token efficiency and reduces memorization and benchmark contamination, because repeatedly seeing the same text usually wastes compute more than it adds knowledge.
SigLIPA contrastive image–text pretraining method (Zhai et al., 2023) that replaces CLIP's softmax-over-batch contrastive loss with a pairwise sigmoid binary cross-entropy. SigLIP removes the need for large global batches, scales batch-size-efficiently, and achieves CLIP-level or better zero-shot accuracy at a fraction of the training compute.
Vision-Language Contrastive Objectives Beyond CLIPSuccessors to CLIP refine its symmetric InfoNCE loss for better efficiency, finer-grained alignment, and scaling. SigLIP replaces softmax with a pairwise sigmoid; LiT freezes a pretrained image tower; ALIGN scales noisily; FILIP does token-level contrast; CoCa adds a captioning head. All share the joint-embedding template.
Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
Transformer-XL / Segment-Level RecurrenceDai et al. (2019) extend Transformers beyond fixed context by caching hidden states of the previous segment and allowing attention to read from them — a simple "segment-level recurrence" that gives an effective receptive field of \( N \cdot L \) for \( L \) layers and segment length \( N \). Paired with relative positional encoding, it was a key bridge between pure attention and long-context models.
Longformer / BigBird (Sparse Long-Context Attention)Fixed sparsity patterns that reduce attention from \( O(n^2) \) to \( O(n) \) for long documents. Longformer combines sliding-window + global attention; BigBird adds random attention and proves the result retains full-attention universal-approximation properties. Both were pre-2022 answers to scaling Transformers to 4k–16k tokens.
RWKVAn RNN-Transformer hybrid (Peng et al., 2023) whose block is a parallelisable linear-attention operation at training time and a simple recurrent state update at inference time. RWKV scales to 14B+ parameters with Transformer-competitive perplexity, offering constant-memory inference.
Rejection Sampling Fine-Tuning (RFT)An offline alignment method: sample many completions from a base model, keep only those that pass a quality / reward filter, and fine-tune on the survivors. RFT is the simplest way to convert a reward model (or a verifier) into improved policy behaviour, and serves as a strong baseline against DPO / PPO.
Self-Play Fine-Tuning (SPIN)Chen et al. (2024) cast fine-tuning as a game between the current model and its previous iterate: the new model must distinguish its own generations from demonstrations, improving it without any new human data. SPIN delivers substantial improvements on supervised data alone, bootstrapping from a weak SFT model toward a stronger one.
Iterative DPOIterative DPO repeats a simple loop: sample responses from the current policy, score or label them, apply a DPO-style update, and repeat. It brings online data collection to preference optimization while keeping DPO's simpler training dynamics compared with PPO-based RLHF.
Best-of-N Sampling and its ScalingAn inference-time boost: sample \( N \) responses from a language model, score them with a reward model or verifier, and return the best. Quality scales with \( \log N \); scaling laws predict the knob where best-of-\( N \) inference compute equals extra training compute, and motivate distillation of best-of-\( N \) behaviour back into the policy via RFT.
Weak-to-Strong GeneralizationOpenAI's analog for scalable oversight (Burns et al., 2023): can a strong model, fine-tuned on labels from a weaker supervisor, generalise beyond the supervisor's capability? Experiments on NLP and chess tasks show it partially can; the residual quality gap motivates future work on supervising superhuman models.
Tool Use / Function Calling Benchmarks (BFCL, τ-bench)BFCL (Berkeley Function Calling Leaderboard, 2024) and τ-bench (2024) evaluate LLMs' ability to select, parameterise, and sequence API calls. BFCL is single-turn function-call accuracy; τ-bench is multi-turn dialogue in simulated customer-service environments with realistic state and policy constraints.
Agent Benchmarks (SWE-bench, GAIA, WebArena)SWE-bench tests whether a model can fix real GitHub issues in code repositories, GAIA tests general tool-using problem solving with automatically checked answers, and WebArena tests web-navigation agents in simulated sites. Together they measure software, reasoning, and browser-action competence rather than just one-shot text generation.
BERT: Bidirectional Encoder PretrainingBERT (Devlin et al., 2019) is a Transformer encoder pretrained with masked language modelling (MLM) and next-sentence prediction (NSP), producing bidirectional contextual representations. Unlike GPT's causal, left-to-right pretraining, MLM sees both past and future tokens, making BERT the go-to encoder for classification, span extraction, and sentence embeddings. Base/Large variants (110M/340M params) dominated GLUE/SQuAD until decoder-only LLMs took over.
WordPiece, Unigram, and SentencePiece TokenisationWordPiece, Unigram, and SentencePiece are subword tokenization schemes used to balance vocabulary size against sequence length. WordPiece builds a vocabulary greedily, Unigram prunes a probabilistic vocabulary by likelihood, and SentencePiece is the language-agnostic toolkit that trains and applies these methods on raw text.
Prefix-Tuning and Prompt-TuningPrompt-tuning and prefix-tuning freeze the base model and learn only a small set of continuous prompt parameters. Prompt-tuning adds learned embeddings at the input, while prefix-tuning adds learned key-value prefixes inside each attention layer.
Adapter Layers (Houlsby-style)Houlsby-style adapters are small bottleneck MLPs inserted inside each Transformer block while the original backbone is frozen. They were the first clean PEFT recipe: add a few trainable parameters per layer, keep the base model unchanged, and specialize the model to new tasks cheaply.
LLaVA: Visual Instruction TuningLLaVA (Liu et al., 2023) connects a frozen CLIP vision encoder to a frozen LLM via a small learned projection, then instruction-tunes the combined model on GPT-4-generated multimodal dialogues. The recipe is minimal — a single linear (later two-layer MLP) projector — yet competitive with closed VLMs, establishing the canonical open VLM pattern: (pretrained vision encoder) + (learned bridge) + (pretrained LLM) + (visual instruction data).
ColBERT and Late-Interaction RetrievalColBERT (Khattab & Zaharia, 2020) represents each document as a bag of contextualised token vectors rather than a single vector, and scores against a query via MaxSim : for each query token, find its best-matching document token and sum the similarities. Late interaction preserves token-level granularity that pooling destroys, closing the quality gap between dense retrievers and cross-encoders at a fraction of the cost.
Conditional Random Fields (CRF)A discriminative model of \( p(y \mid x) = \tfrac{1}{Z(x)} \exp \sum_k \lambda_k f_k(y, x) \) over structured outputs. Linear-chain CRFs add sequence-level constraints on top of per-token scores, enabling tractable training via forward–backward and Viterbi decoding — still the backbone of NER/tagging heads above neural encoders.
Noise-Contrastive Estimation (NCE)Learn an unnormalised model \( \tilde p_\theta(x) \) by training a binary classifier to distinguish data samples from noise samples. The classifier's logit becomes \( \log \tilde p_\theta(x) - \log q_{\text{noise}}(x) \), so the partition function is absorbed into a learnable constant. Foundation of word2vec's negative sampling and of InfoNCE contrastive learning.
Pointer NetworksA seq2seq architecture whose decoder outputs indices into the input sequence via attention weights (a 'pointer') rather than tokens from a fixed vocabulary. Ideal for combinatorial problems whose output vocabulary depends on the input (convex hull, TSP, sorting) and for extractive QA / span prediction.
Data Curation & Quality Filters (FineWeb, Dolma)Modern pretraining pipelines filter terabytes of web data through language ID, heuristic rules (repetition, punctuation ratios), classifier-based quality scoring, and toxicity / PII removal. The FineWeb and Dolma recipes document which filters mattered — often delivering per-token quality gains equivalent to 2–3× scale-up.
MinHash & LSH for Large-Scale DeduplicationMinHash approximates Jaccard similarity between sets, and locality-sensitive hashing uses that approximation to quickly find near-duplicates. In ML data pipelines this is used to deduplicate documents or examples at very large scale.
Synthetic Data Generation for Post-TrainingModern instruction-tuning and RL-based alignment rely on LLM-generated synthetic data: self-instruct / Evol-Instruct expand seed prompts, teacher models produce high-quality completions, and process-reward models validate chain-of-thought steps. The backbone of the post-ChatGPT post-training stack.
Continual Pretraining & Mid-TrainingContinue pretraining an existing base model on a domain or task-focused corpus (code, math, a new language) before final post-training. Achieves domain gains that would cost 10× more to obtain by fine-tuning alone. Sits between pretraining and SFT in modern recipes.
Self-Rewarding Language ModelsTrain a model to both generate responses and judge them, using the same weights. Each iteration: generate a pool of candidates, self-rate them, extract preference pairs, DPO-train on the pairs. Recursive improvement without external reward data; bottlenecks surface around judgement quality and diversity collapse.
Long-Context Data Recipes (RULER, Needle Variants)Extending effective context beyond 128k requires (a) RoPE-scaling or position-interpolation to keep positional encodings sane, (b) a continued-pretraining dataset with real long documents and synthetic stitched tasks, and (c) evaluation beyond simple needle-in-a-haystack — RULER adds multi-needle, multi-hop, and aggregation subtasks that expose superficial-match shortcuts.
Tokenizer Training Dynamics & Vocab SizingTokenizer design trades vocabulary size against sequence length and changes both compute cost and what patterns the model can represent cleanly. BPE, unigram, and byte-level schemes make different compromises, especially for code, multilingual text, and rare domain terms.
ORPO (Odds-Ratio Preference Optimization)A reference-model-free preference-optimisation objective that combines SFT and preference learning in one loss: \( \mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} - \lambda \log \sigma(\log \text{odds}(y_w) - \log \text{odds}(y_l)) \). Eliminates DPO's reference-model requirement, halving training memory.
SimPO (Simple Preference Optimization)Reference-free preference objective that replaces DPO's reference ratio with a length-normalised policy log-likelihood: \( r(x, y) = \log \pi_\theta(y\mid x) / |y| \). Adds a margin \( \gamma \) to the preferred response. Simpler than DPO, matches or beats it, and length-normalisation reduces verbosity exploitation.
IPO (Identity Preference Optimization)Replaces DPO's sigmoid objective with a squared-error criterion on preference probabilities: \( \mathcal{L}_{\text{IPO}} = \mathbb{E}[(h_\theta(y_w, y_l) - 1/(2\beta))^2] \). Prevents DPO's tendency to over-separate preferred and rejected responses on easy pairs, reducing overfitting and improving generalisation.
RLVR: RL from Verifiable RewardsTrain a reasoning policy with pure RL signals from tasks whose answers are automatically verifiable — math (exact match), code (unit-test execution), proof (checker), chess (engine eval). No preference model needed. The method behind DeepSeek-R1-Zero and the o1-style long-CoT reasoning families.
Reward Hacking & Specification Gaming in RLHFWhen a learned reward model is a proxy for human preference, RL optimisation finds adversarial inputs that maximise reward without matching the true objective — verbose apologies, sycophancy, confident wrong answers, format exploitation. Goodhart's law in practice. Mitigations range from KL penalty and reward normalisation to process supervision and debate.
Refusal Training & Harmlessness ObjectivesTeach a model to decline unsafe or out-of-policy queries while remaining helpful on benign ones. Typical recipe: SFT on curated refusal examples, preference optimisation with harm-labelled pairs, red-team-driven iteration. Badly done, this causes over-refusal (the 'brittle helpfulness' regime); done well, it achieves high refusal rate with minimal utility loss.
Graph of ThoughtsGeneralises chain-of-thought and tree-of-thought to an arbitrary DAG of reasoning nodes, each produced by a prompt. Enables aggregation (merge parallel solutions), refinement loops, and reuse of intermediate results. Useful for problems where branch-and-combine dominates (sorting, constraint satisfaction, multi-step planning).
Toolformer & Learned Tool UseToolformer trains a language model to decide when to call external tools and what arguments to send without requiring human demonstrations of tool use. It keeps tool calls only when the returned result improves the continuation, making tool use a self-supervised learning signal.
Plan-and-Solve PromptingTwo-stage prompt pattern: first ask the model to produce a plan (step-by-step outline), then execute the plan step by step. Improves over naive zero-shot CoT on arithmetic and multi-hop reasoning by forcing explicit decomposition. Contrasts with ReAct's interleaved thought-action loop.
Constrained Decoding: Grammars, JSON Mode, RegexGrammar-constrained decoding masks any next token that would violate a target grammar or schema. This guarantees outputs such as valid JSON, XML, or regex-matching strings by restricting generation to the language accepted by a finite-state machine or pushdown automaton.
Model Context Protocol (MCP)Model Context Protocol is an open client-server protocol for connecting models and agents to external tools, resources, and prompts through a common interface. It standardizes capability discovery and tool invocation so one MCP-aware client can talk to many different servers without bespoke integrations.
Agentic Workflows & Multi-Agent OrchestrationSystems that compose multiple LLM calls — planner, executor, critic, tool-user — into an end-to-end workflow. Patterns range from ReAct loops to fixed DAGs (LangGraph) to role-playing ensembles (AutoGPT, BabyAGI). Success requires careful handoff design, termination criteria, and cost control.
LLM Watermarking (Kirchenbauer et al.)Embed a statistical signature into generated text that is invisible to humans but detectable by an algorithm with the watermarking secret. Kirchenbauer et al. (2023) partition the vocabulary into a pseudo-random green / red list per step, biasing generation toward green; later detection uses a \( z \)-test on green-token frequency.
Prompt Injection: Taxonomy & DefencesAdversarial instructions embedded in model-accessible content — tool outputs, retrieved documents, emails — that override the user's original task. Direct (in user prompt) vs indirect (in external content). Defences include input filtering, dual-model separation, and structured prompt templates; none is a complete solution.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Text-to-Image AlignmentText-to-image (T2I) alignment is the task of making generated images faithfully follow textual prompts — covering spatial layout, attribute binding, count, and style. Modern alignment relies on contrastive image–text encoders (CLIP, SigLIP, T5) injected via cross-attention into a diffusion or flow backbone, plus classifier-free guidance, RLHF-style preference fine-tuning, and reward models that grade prompt adherence.
RLHF as KL-Regularized Policy OptimizationA deeper theoretical view of RLHF treats post-training as optimizing a policy against a learned reward while regularizing toward a reference model with a KL penalty. This viewpoint explains why PPO-RLHF, reward-model training, and even DPO-style objectives are closely related: they are different ways of solving or approximating the same regularized preference-optimization problem.
Latent Dirichlet Allocation (LDA Topic Models)Latent Dirichlet Allocation models each document as a mixture of latent topics and each topic as a distribution over words. It is a generative model for uncovering coarse semantic structure in bag-of-words corpora, not a modern contextual language model.
Bahdanau AttentionBahdanau attention is the original additive attention mechanism for sequence-to-sequence models, where the decoder scores each encoder state before producing the next token. It solved the fixed-context bottleneck of early seq2seq RNNs by letting the decoder look back over the whole source sequence at every step.
Seq2Seq with AttentionSeq2seq with attention augments the encoder-decoder architecture so the decoder conditions on a context vector built from all encoder states at each output step. That change made neural machine translation far more effective than fixed-context seq2seq and directly paved the way to modern cross-attention and Transformer models.