Tag: architecture

125 topic(s)

Multimodal Secure AlignmentMultimodal secure alignment is the problem of making a model's safety behavior consistent across text, images, audio, and mixed-modal inputs. It matters because a model can reconstruct harmful intent across modalities or through images that evade text-only filters, so defenses must align the fused system rather than just one input channel.
Constitutional Classifiers++Constitutional Classifiers++ is a production-oriented jailbreak defense that uses context-aware classifiers and a cascade of cheap and expensive checks to block harmful exchanges efficiently. The system is designed to keep refusal rates and serving cost low while still catching universal jailbreaks that earlier, response-only filters missed.
Continuous Thought Machines (CTM)Continuous Thought Machines are models that make neural timing and synchronization part of the representation, instead of treating layers as purely instantaneous mappings. They use neuron-level temporal processing and support adaptive compute, so the same model can stop early on easy inputs or continue reasoning on harder ones.
T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
ELMo (Embeddings from Language Models)ELMo produces contextualized word embeddings by taking a learned task-specific combination of hidden states from a pretrained bidirectional LSTM language model. Unlike static embeddings such as word2vec or GloVe, it gives the same word different vectors in different sentence contexts.
Sparsely-Gated Mixture of Experts (MoE)A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
Key-Value Memory NetworksKey-Value Memory Networks store each memory slot as a key for retrieval and a separate value for the returned content. This decouples matching from payload and is a direct conceptual precursor to modern query-key-value attention.
Neural Turing Machine (NTM)A Neural Turing Machine augments a neural controller with a differentiable external memory that it can read from and write to using soft attention over memory locations. It was an early attempt to learn algorithm-like behavior such as copying and sorting while remaining trainable end to end.
PagedAttentionPagedAttention stores the KV cache in fixed-size non-contiguous blocks, like virtual-memory pages, instead of requiring one contiguous allocation per sequence. This largely removes fragmentation, enables prompt-prefix sharing, and is a key reason vLLM can serve many more concurrent requests.
Speculative DecodingSpeculative decoding speeds up autoregressive generation by letting a small draft model propose several tokens and then having the large target model verify them in parallel. With the rejection-sampling correction from the original algorithm, the output distribution remains exactly the same as sampling from the target model alone.
AutoencodersAutoencoders are neural networks trained to reconstruct their inputs after passing them through a compressed or otherwise constrained latent representation. They are useful because the bottleneck forces the model to learn structure in the data rather than just memorize an identity map.
Bootstrap Aggregating (Bagging)Bootstrap Aggregating trains multiple models on bootstrap-resampled versions of the training set and averages their predictions to reduce variance. It helps most with unstable base learners such as decision trees, which is why it underlies random forests.
Gradient Boosting Machines (GBM)Gradient Boosting Machines build an additive model by fitting each new weak learner to the negative gradient of the current loss. In practice each stage focuses on correcting the remaining errors of the ensemble, which makes boosting powerful but sensitive to overfitting if trees and learning rates are not controlled.
Affine transformationAn affine transformation is a linear map followed by a translation, so it has weights and a bias. Dense neural network layers are affine rather than strictly linear, because the bias lets the model shift activations and decision boundaries.
Forward passThe forward pass is the computation that maps input data through the model to produce activations and an output prediction. During training it also caches intermediate values needed later by the backward pass.
Neural networkA neural network is a parameterized function built by composing affine transformations with nonlinear activations across layers. Its power comes from learning representations from data rather than relying on hand-crafted features for each task.
NeuronA neuron in a neural network computes a weighted sum of its inputs, adds a bias, and applies an activation function. Collections of neurons form layers, so a single neuron's role is simple even though many together can represent complex functions.
Activation functionAn activation function is the nonlinear mapping applied after an affine transformation in a neural network. It is what prevents a stack of layers from collapsing into one affine map, enabling deep networks to approximate complex functions.
SigmoidThe sigmoid function maps a real number to a value between zero and one, making it easy to interpret as a probability or gate. Its downside is saturation at large positive or negative inputs, which can cause vanishing gradients in deep networks.
TanhThe tanh function maps inputs to the range minus one to one and is zero-centered, which often makes optimization easier than with the sigmoid. Like the sigmoid, however, it still saturates at large magnitudes and can cause vanishing gradients.
ReLUReLU outputs the positive part of its input and zero otherwise. It became the default activation in many deep networks because it is simple, cheap, and far less prone to saturation than sigmoid or tanh, though units can still die if they stay on the zero side.
SoftmaxSoftmax turns a vector of logits into a probability distribution by exponentiating and normalizing them so the components sum to one. It is commonly used for multiclass prediction because it converts arbitrary scores into class probabilities while preserving their ranking.
Fully connected layerA fully connected layer applies an affine transformation in which every output unit depends on every input feature. It is the standard dense layer used in multilayer perceptrons and as a projection block inside many larger architectures.
Input layerThe input layer is the entry point of a network, where raw or preprocessed features are presented to the model. Unlike hidden layers, it usually performs little or no learned computation by itself and mainly defines the representation the rest of the network receives.
Hidden layerA hidden layer is any internal layer between the input and output of a network. Hidden layers transform raw inputs into increasingly useful intermediate representations that the final output layer can read out.
Output layerThe output layer is the final transformation that maps a model's last hidden representation to a prediction space such as class logits, probabilities, or regression values. Its shape and activation depend on the task being solved.
Multilayer perceptron (MLP)A multilayer perceptron is a feedforward neural network made of stacked fully connected layers and nonlinear activations. It is the canonical dense architecture for tabular function approximation and the feed-forward subnetwork inside many Transformer blocks.
Deep neural networkA deep neural network is a neural network with multiple hidden layers rather than just one or two. The extra depth lets it build hierarchical features and represent complex functions more efficiently than shallow networks in many settings.
Feedforward neural networkA feedforward neural network is a network whose computations move from input to output without recurrent cycles. Each layer depends only on earlier activations in the same pass, making feedforward networks the basic template for MLPs and many vision models.
Convolutional neural network (CNN)A convolutional neural network uses learned convolution filters with local receptive fields and weight sharing to process grid-like data such as images. Those inductive biases make CNNs especially effective and parameter-efficient for visual pattern recognition.
Recurrent neural network (RNN)A recurrent neural network processes sequences by maintaining a hidden state that is updated one step at a time from the current input and previous state. This gives it a notion of temporal memory, but plain RNNs are hard to train on long dependencies because gradients can vanish or explode.
Elman RNNAn Elman RNN is the classic simple recurrent network in which the next hidden state is a nonlinear function of the current input and previous hidden state. It introduced the basic hidden-state recurrence used by later gated models, but long-range memory is poor without gating.
Hidden stateThe hidden state is the internal representation a sequential model carries forward as it processes inputs over time. In an RNN or LSTM it summarizes relevant past context, and in broader neural architectures it usually means a layer's intermediate activation vector.
Long short-term memory (LSTM)Long short-term memory is a gated recurrent architecture designed to preserve information over long timescales. Its input, forget, and output gates regulate a cell state with near-linear self-connections, which helps prevent the vanishing-gradient behavior of simple RNNs.
xLSTMxLSTM is a family of modern LSTM variants that adds exponential gating and redesigned memory structures, including scalar-memory and matrix-memory forms, to make recurrent models more scalable. The goal is to keep LSTM-style recurrence while improving stability, parallelism, and long-context performance.
minLSTMminLSTM is a simplified LSTM variant designed to remove some of the sequential dependencies that make classical LSTMs expensive while keeping useful gating behavior. The result is a lighter recurrent block that can be trained more efficiently and scaled more easily.
Gated recurrent unit (GRU)A gated recurrent unit is a recurrent architecture that uses update and reset gates to control how much past information is kept and how much new input is written into the hidden state. It is simpler than an LSTM because it has no separate cell state, yet it often achieves similar sequence-modeling performance.
Embedding layerAn embedding layer maps discrete IDs such as words, subwords, or items to learned dense vectors. It is essential whenever symbolic inputs must be represented in a continuous space that gradient-based models can manipulate.
Autoregressive language modelAn autoregressive language model generates text left-to-right by modeling \( P(w_t \mid w_{<t}) \) for each token. Because it only conditions on past tokens, it can be used directly for open-ended generation as well as scoring sequences.
Masked language modelA masked language model is trained to recover tokens hidden within a sequence using both left and right context. This bidirectional training makes MLMs strong encoders for understanding tasks, but less natural than autoregressive models for direct generation.
Causal language modelA causal language model predicts each token using only earlier tokens, enforced by a causal attention mask. It is essentially the same modeling family as an autoregressive language model, with the word 'causal' emphasizing the masking constraint in self-attention.
PerceptronThe perceptron is a linear threshold classifier that predicts a class from the sign of \( w^\top x + b \) and updates its weights only on mistakes. It is historically important because it introduced gradient-like learning for linear separators, but it only converges when the data are linearly separable.
TransformerA Transformer is a sequence model built from self-attention, position-wise MLPs, residual connections, and normalization, rather than recurrence or convolution. Its key advantage is that every token can directly attend to every other token in parallel, which made modern LLM scaling practical.
Decoder BlockA decoder block is the basic unit of a decoder-only Transformer: causal self-attention plus a position-wise MLP, wrapped with residual connections and normalization. Stacking these blocks lets the model mix context across tokens while preserving autoregressive generation.
Decoder-only TransformerA decoder-only Transformer is a Transformer architecture composed only of masked self-attention blocks, so each token can attend only to earlier tokens. This makes it the standard architecture for autoregressive language models such as GPT, LLaMA, and Claude.
Position-wise MLPA position-wise MLP is the feed-forward sublayer in a Transformer block, applied independently to each token after attention. It adds nonlinearity and channel mixing per token, complementing attention, which mixes information across positions.
Residual Connection (Skip Connection)A residual connection adds a layer's input back to its output, so the layer learns a correction rather than an entirely new representation. This stabilizes optimization, improves gradient flow, and is one reason very deep networks and Transformers train reliably.
Context WindowThe context window is the maximum number of tokens a model can process in one forward pass. It defines the model's accessible working memory at inference time, and longer windows increase both usefulness on long documents and computational cost.
Large Language Model (LLM)A large language model is a very large neural language model, usually with billions of parameters, pretrained on massive text corpora. Scale gives LLMs broad world knowledge and emergent capabilities such as in-context learning, but the core training objective is still language modeling.
Sparse Mixture-of-Experts (MoE) LayerA sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
Router NetworkA router network scores experts or computation paths for each token and decides where that token should be sent in a conditional-compute model such as an MoE. A good router improves specialization while avoiding collapsed routing, overload, and excessive communication.
Expert NetworkAn expert network is one of the specialized submodules inside an MoE layer that processes only the tokens routed to it. Experts usually share the same architecture but learn different functions, so specialization emerges from routing plus load-balancing constraints.
Top-k RoutingTop-k routing sends each token only to the k highest-scoring experts instead of to every expert. This makes MoE computation sparse and efficient, but the choice of k trades off compute cost, robustness, and routing stability.
Load Balancing (MoE)Load balancing in MoE training adds losses or routing constraints so tokens are spread across experts instead of collapsing onto a few popular ones. It matters because uneven routing wastes capacity, creates bottlenecks, and leaves underused experts poorly trained.
Switch TransformerSwitch Transformer is a simplified MoE Transformer that routes each token to exactly one expert in each sparse feed-forward layer. Top-1 routing reduces communication and implementation complexity, enabling very large sparse models, but makes router stability and load balancing especially important.
Structured PruningStructured pruning removes whole channels, heads, layers, or blocks, producing regular sparsity that hardware can exploit directly. It usually yields better real-world speedups than unstructured pruning, though it gives less fine-grained control.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
Cross-AttentionCross-attention lets one sequence or modality attend to representations produced by another sequence or modality. In encoder-decoder models the decoder queries encoder states, and in multimodal models text tokens often query visual features the same way.
Vision EncoderA vision encoder maps an image into features or tokens that downstream modules can use for classification, retrieval, or generation. CNNs and Vision Transformers are common vision encoders, differing mainly in how they represent spatial structure.
CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
Positional EncodingPositional encoding injects token order information into architectures like Transformers whose attention is otherwise permutation-invariant. It can be absolute or relative, and the choice strongly affects extrapolation, long-context behavior, and inductive bias.
Absolute Position EncodingAbsolute position encoding assigns each sequence position its own encoding or embedding and combines it with token representations. It works well inside the trained context range, but it often extrapolates poorly because positions are treated as fixed IDs rather than relative distances.
Relative Position EncodingRelative position encoding represents how far apart tokens are rather than assigning each position a standalone ID. That lets attention depend on distance or offset, which often improves length generalization and transfers patterns more naturally across positions.
Attention MechanismAttention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
Encoder-Decoder ArchitectureAn encoder-decoder architecture uses an encoder to turn an input sequence into representations and a decoder to generate an output sequence conditioned on those representations. It is the standard design for translation, summarization, and other input-to-output generation tasks.
Sequence-to-Sequence (Seq2Seq)Sequence-to-sequence learning maps one sequence to another, often with different lengths, such as translation or summarization. Modern seq2seq models are usually encoder-decoder Transformers, though earlier versions used recurrent networks with attention.
Batch NormalizationBatch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
Layer NormalizationLayer normalization normalizes activations across features within each example rather than across the batch. It works well for variable-length sequences and small batch sizes, which is why it is standard in Transformers.
Sparse Autoencoder (Mechanistic Interpretability)In mechanistic interpretability, a sparse autoencoder is trained on model activations to decompose dense, superposed representations into a larger set of sparse features. This often makes latent structure more interpretable, because individual learned directions can line up with human-readable concepts or behaviors.
Classification HeadA classification head is the final task-specific layer that maps learned representations to class logits or probabilities. In transfer learning it is often the only part trained from scratch, while the backbone provides reusable features.
Softmax HeadA softmax head is the output projection plus softmax normalization that converts hidden representations into a probability distribution over classes or vocabulary items. In language models it is the layer that turns the final hidden state into next-token probabilities.
Encoder (Transformer)A Transformer encoder is a stack of self-attention and feed-forward blocks that builds contextual representations of an input sequence. Because encoder self-attention is usually bidirectional, it is well suited for understanding tasks such as classification, retrieval, and sequence labeling.
Decoder (Transformer)A Transformer decoder is the autoregressive half of the architecture that predicts tokens using causal self-attention and, in encoder-decoder models, optional cross-attention to an encoder output. Its defining constraint is that each position can attend only to earlier positions when generating.
Kolmogorov-Arnold NetworksKolmogorov-Arnold Networks replace fixed scalar weights on edges with learnable one-dimensional functions, so layers are built from sums of learned univariate transforms rather than simple affine maps. They are motivated by the Kolmogorov-Arnold representation theorem and are often discussed as a more interpretable alternative to MLPs, not a universal replacement.
State Space Models / MambaState space models such as Mamba process sequences by evolving a learned hidden state through recurrence rather than full quadratic attention. Their main appeal is linear-time sequence processing with strong long-context efficiency, especially when selective state updates let the model decide what to remember.
Auxiliary Load-Balancing Loss (MoE)The auxiliary load-balancing loss in a Mixture-of-Experts model encourages the router to spread tokens more evenly across experts. Without it, routing often collapses onto a few experts, which wastes capacity and creates severe hot spots in both learning and systems performance.
Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
Masked Autoencoder (MAE)A self-supervised ViT pretraining objective: randomly mask 75% of image patches and train an asymmetric encoder–decoder to reconstruct pixel values from the visible 25%. MAE is simple, compute-efficient (the encoder sees only unmasked patches), and produces state-of-the-art ImageNet fine-tuning representations.
DINOv2A self-supervised ViT pretraining recipe from Meta (Oquab et al., 2023) that combines a DINO-style self-distillation objective with an iBOT masked-patch prediction objective and a curated 142M-image dataset. DINOv2 produces general-purpose frozen visual features that outperform task-specific supervised baselines on classification, segmentation, depth, and correspondence.
Diffusion Transformers (DiT)Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
Encodec / Neural Audio CodecsMeta's Encodec (2022) is a neural audio codec that compresses audio to discrete tokens via residual vector quantisation (RVQ) and reconstructs it with a neural decoder. Encodec is the tokeniser of choice for generative audio models (AudioLM, MusicGen, VALL-E), bridging continuous audio and LLM-style discrete modelling.
Residual Networks (ResNet as Architecture)A residual network replaces a plain layer stack with blocks that learn a residual update \(F(x)\) and add it back to the input, so each block computes \(y = x + F(x)\). This makes very deep CNNs much easier to optimize and triggered the shift from VGG-style stacks to residual architectures.
Transformer-XL / Segment-Level RecurrenceDai et al. (2019) extend Transformers beyond fixed context by caching hidden states of the previous segment and allowing attention to read from them — a simple "segment-level recurrence" that gives an effective receptive field of \( N \cdot L \) for \( L \) layers and segment length \( N \). Paired with relative positional encoding, it was a key bridge between pure attention and long-context models.
Longformer / BigBird (Sparse Long-Context Attention)Fixed sparsity patterns that reduce attention from \( O(n^2) \) to \( O(n) \) for long documents. Longformer combines sliding-window + global attention; BigBird adds random attention and proves the result retains full-attention universal-approximation properties. Both were pre-2022 answers to scaling Transformers to 4k–16k tokens.
RetNet / Retention NetworksSun et al. (2023) introduce a Transformer-alternative block whose retention operator admits three equivalent forms: parallel (for training), recurrent (for \( O(1) \) inference per token), and chunkwise-recurrent (for long-sequence training). RetNet aims for RNN-like inference cost with Transformer-like parallelisable training.
RWKVAn RNN-Transformer hybrid (Peng et al., 2023) whose block is a parallelisable linear-attention operation at training time and a simple recurrent state update at inference time. RWKV scales to 14B+ parameters with Transformer-competitive perplexity, offering constant-memory inference.
Hyena / Long ConvolutionsPoli et al. (2023) propose replacing attention with a data-controlled long-range convolution: a filter parameterised implicitly by an MLP-of-positions, applied via FFT for \( O(n \log n) \) cost. Hyena approaches Transformer quality on pretraining perplexity at a fraction of the compute.
xFormers / Memory-Efficient AttentionA library / pattern of attention implementations that avoid materialising the \( n \times n \) attention matrix, reducing memory from \( O(n^2) \) to \( O(n) \). xFormers bundles FlashAttention, Memory-Efficient Attention (Rabe & Staats), block-sparse variants, and ALiBi/RoPE patches under a unified API — a precursor to the default attention kernels shipped in PyTorch 2.0+.
Ring Attention / Context Parallel for Long SequencesA distributed-attention algorithm that shards an \( n \)-token sequence across \( P \) devices and computes each attention output via a ring of key-value rotations. Ring Attention (Liu et al., 2023) enables context lengths of millions of tokens on multi-GPU clusters with near-linear scaling.
Mixture of Depths (MoD)Mixture of Depths (Raposo et al., 2024) lets each token choose whether to go through the expensive self-attention + MLP stack at each layer, or to skip it via a residual. A small router predicts a saliency score; the top-\( k \) tokens per batch compute, the rest pass through. This per-token adaptive compute is the depth-axis counterpart of Mixture-of-Experts (width-axis) and substantially reduces FLOPs at matched quality.
Graph Convolutional Network (GCN)Kipf & Welling's GCN (2017) applies a first-order spectral convolution on graphs: each node's representation is updated as a normalised sum of its neighbours' features, \( H^{(\ell+1)} = \sigma(\tilde D^{-1/2} \tilde A \tilde D^{-1/2} H^{(\ell)} W^{(\ell)}) \). The symmetric normalisation comes from a spectral argument; practically, it is the simplest and most widely-taught GNN layer, setting the template for all message-passing architectures.
Graph Attention Network (GAT)GAT (Veličković et al., 2018) replaces GCN's fixed degree-normalised aggregation with attention : each node learns per-edge weights via a shared attention mechanism. This gives inductive generalisation (no dependence on the full graph's degree matrix), handles heterogeneous neighbourhoods, and approaches Transformer-style flexibility — though at higher computational cost than GCN.
U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
Object Detection: R-CNN → Faster R-CNN → DETRR-CNN ran a CNN classifier on externally-proposed regions; Fast R-CNN shared backbone features across proposals; Faster R-CNN introduced a learned Region Proposal Network; DETR replaced the entire region-proposal pipeline with a transformer that predicts a fixed set of boxes via bipartite matching.
PointNet & 3D Deep Learning on Point CloudsPointNet processes an unordered point set by applying a shared MLP to each point, then pooling across points with a symmetric function (max-pool). Permutation-invariant by construction; PointNet++ adds local-region hierarchies to capture geometric structure.
Neural Ordinary Differential EquationsA neural ODE defines the hidden-state evolution as \( dh/dt = f_\theta(h, t) \), integrated by a black-box ODE solver. Training uses the adjoint method to back-propagate at constant memory regardless of solver depth. Connects residual networks to continuous flows and underlies continuous normalising flows and flow matching.
Pointer NetworksA seq2seq architecture whose decoder outputs indices into the input sequence via attention weights (a 'pointer') rather than tokens from a fixed vocabulary. Ideal for combinatorial problems whose output vocabulary depends on the input (convex hull, TSP, sorting) and for extractive QA / span prediction.
Siamese Networks & Metric LearningTrain a shared encoder so that semantically similar inputs map to nearby embeddings (contrastive / triplet loss) or that query-key scores reflect similarity directly. Used in face verification, signature matching, image retrieval; the conceptual parent of SimCLR, CLIP, and BiEncoder retrieval.
Highway NetworksPredecessor to ResNet: a gated skip connection \( y = H(x) \cdot T(x) + x \cdot (1 - T(x)) \), where \( T(x) \in [0,1] \) is a learned transform gate. Enabled training of 100+ layer networks before residual connections simplified the construction in ResNet.
Capsule NetworksHinton's alternative to CNN pooling: neurons are grouped into 'capsules' whose vector output encodes both existence and pose of an entity. Dynamic routing by agreement replaces max-pool, so each capsule decides which higher-level capsule to vote for based on agreement. Historically significant; practically superseded by transformers.
Memory-Augmented TransformersTransformer variants that extend effective context length with an external memory — Recurrent Memory Transformers (RMT) pass summary tokens across chunks, Memorizing Transformers retrieve past kNN keys, Infini-attention compresses the tail of context into a linear-attention state. A bridge between fixed-context Transformers and sequence models with unbounded memory.
Set Transformer & Deep Sets (Permutation Invariance)Deep Sets: any permutation-invariant function on sets equals \( \rho(\sum_i \phi(x_i)) \) for learnable \( \phi, \rho \). Set Transformer replaces the sum with self-attention via Induced Set Attention Blocks, giving element-wise interactions while remaining permutation-equivariant.
Medusa & EAGLE Speculative-Decoding HeadsSpeculative decoding with learned draft heads instead of a separate draft model. Medusa adds \( K \) small lightweight heads on the base model predicting future tokens; the base verifies tree-hypotheses in one forward pass. EAGLE models the residual stream directly and achieves 3–4× speedup with a tiny draft network.
CTC Loss & RNN-TransducerTwo objectives for training sequence-to-sequence models when alignment between input and output frames is unknown. CTC sums over all alignment paths with blank symbols; RNN-T decomposes into a prediction network and joint network to model output-length independently. Backbones of modern ASR pipelines.
Conformer ArchitectureA hybrid CNN-plus-attention block for speech recognition: each Conformer layer combines multi-head self-attention (global), depthwise convolutions (local), and sandwiched feed-forward modules. Outperforms pure Transformer and pure CNN on LibriSpeech and became the de facto encoder for production ASR.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Algorithmic Alignment TheoryA neural architecture generalises better on a reasoning task when its computational structure aligns with the algorithm that solves the task. Xu et al. (2020) formalise sample complexity in terms of the number of network modules that must be learned and the per-module learnability, predicting that GNNs (multi-step message passing) align with dynamic-programming algorithms while plain MLPs do not.
Invariance vs EquivarianceA representation is invariant to a transformation if the output does not change when the input is transformed, and equivariant if the output changes in a predictable transformed way. CNN translation equivariance and classifier translation invariance are the canonical example pair.
Autoregressive vs Diffusion TradeoffsAutoregressive models factorise \( p(x) = \prod_t p(x_t \mid x_{<t}) \) and dominate text generation; diffusion models learn a denoising process and dominate continuous-modality generation. The two paradigms differ in likelihood tractability, sampling cost, controllability, and compositionality — and the right choice depends on whether tokens are discrete, parallel decoding is required, and whether log-likelihood or perceptual quality is the figure of merit.
Neural Fields / Implicit Neural RepresentationsA neural field represents a continuous signal with a neural network that maps coordinates to values such as color, density, or signed distance. This makes the model itself a compact continuous representation of an image, shape, or scene, with NeRF as the best-known example.
Vision Transformer (ViT) VariantsSince the original ViT, a wide family of variants has emerged that improve data efficiency, locality, hierarchy, and pretraining objective. The most influential are DeiT (training recipe), Swin (windowed hierarchical attention), MAE (masked-image pretraining), DINOv2 (self-distilled features), and SigLIP (sigmoid contrastive pretraining). Each addresses a specific weakness of the vanilla ViT.
Graph TransformersGraph Transformers apply self-attention over graph-structured data, with positional encodings (Laplacian eigenvectors, random walks, shortest-path distances) injecting graph topology that vanilla attention lacks. They generalise message-passing GNNs and have become the leading architecture for molecular property prediction, code understanding, and combinatorial optimisation.
Perceiver ArchitectureThe Perceiver uses cross-attention from a small latent array to a potentially very large input, then performs most computation in latent space. This decouples cost from input length and makes one architecture usable across images, audio, video, and other modalities.
Neural Architecture Search (modern approaches)NAS automates the design of network architectures by searching a parameterised space against a validation objective. Modern NAS abandons the slow RL-controller approach (NASNet) in favour of weight-sharing one-shot supernets (DARTS, ProxylessNAS), zero-cost proxies, and architecture-aware scaling laws (EfficientNet, NFNet). NAS has produced strong vision backbones but plays a smaller role in the LLM era.
Modular Neural NetworksA modular network composes specialised sub-networks (modules) under a routing or composition rule that decides which modules process each input. Mixture-of-Experts is the most successful instance, but the family includes routed adapter networks, modular meta-learners, and compositional architectures designed for systematic generalisation. The motivation is parameter efficiency and reusable skills.
Neural Program InterpretersA neural program interpreter (NPI) is a network that executes program-like computations: looking up arguments, calling sub-routines, manipulating an external memory or stack, and conditioning on intermediate state. Early work (NPI, NTM, DNC, NeuralGPU) targeted symbolic algorithms; modern descendants are tool-using LLMs and chain-of-thought executors that lean on external interpreters and structured memory.
Compressed Sparse Attention (CSA)Compressed Sparse Attention (CSA) is a long-context attention scheme that first compresses the KV cache into block summaries and then performs sparse attention only over the top-k relevant compressed blocks. An added sliding-window branch preserves exact local dependencies, so CSA cuts both KV memory and long-context attention compute without collapsing into a purely local window.
Attention Is All You Need“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.
AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.
Bahdanau AttentionBahdanau attention is the original additive attention mechanism for sequence-to-sequence models, where the decoder scores each encoder state before producing the next token. It solved the fixed-context bottleneck of early seq2seq RNNs by letting the decoder look back over the whole source sequence at every step.
Seq2Seq with AttentionSeq2seq with attention augments the encoder-decoder architecture so the decoder conditions on a context vector built from all encoder states at each output step. That change made neural machine translation far more effective than fixed-context seq2seq and directly paved the way to modern cross-attention and Transformer models.