Tag: landmark
95 topic(s)
- The Bitter LessonThe Bitter Lesson is Sutton's argument that, over the long run, general methods that scale with compute and data outperform systems built around hand-crafted domain knowledge. It is a historical pattern claim, not a theorem, and its force comes from repeated examples in search, game playing, vision, and language.
- T5 (Text-to-Text Transfer Transformer)T5 is an encoder-decoder Transformer that casts every NLP task as text-to-text generation, so translation, question answering, classification, and even some regression tasks share the same model and loss. Its span-corruption pretraining on C4 made it a landmark demonstration of unified transfer learning.
- GPT-2 & Zero-Shot Task TransferGPT-2 showed that a large decoder-only language model can perform many tasks in the zero-shot setting by continuing a task-formatted prompt rather than being fine-tuned. The key result was that scale and diverse web text made translation, summarization, and question answering look like ordinary next-token prediction.
- GPT-1 (Generative Pre-Training)GPT-1 established the pretrain-then-fine-tune recipe for Transformers: first train a decoder on unlabeled text with a language-model objective, then adapt it to downstream tasks with minimal task-specific layers. This showed that generic generative pretraining could beat many bespoke NLP architectures on downstream benchmarks.
- ELMo (Embeddings from Language Models)ELMo produces contextualized word embeddings by taking a learned task-specific combination of hidden states from a pretrained bidirectional LSTM language model. Unlike static embeddings such as word2vec or GloVe, it gives the same word different vectors in different sentence contexts.
- GloVe Word EmbeddingsGloVe learns word embeddings by fitting vector dot products to the log of global word-word co-occurrence counts. Because it is trained on ratios of co-occurrence statistics, linear relations such as king minus man plus woman approximately equals queen often emerge in the embedding space.
- Neural Turing Machine (NTM)A Neural Turing Machine augments a neural controller with a differentiable external memory that it can read from and write to using soft attention over memory locations. It was an early attempt to learn algorithm-like behavior such as copying and sorting while remaining trainable end to end.
- ImageNet DatasetImageNet is a large, hierarchically labeled image dataset whose 1000-class ILSVRC benchmark became the defining testbed for modern computer vision. AlexNet's 2012 win on ImageNet triggered the deep learning shift by showing that GPU-trained CNNs could dramatically beat hand-engineered pipelines.
- Neural Probabilistic Language ModelThe Neural Probabilistic Language Model replaced count-based n-grams with learned word embeddings and a neural network that predicts the next word from a continuous representation of context. Its core contribution was showing that distributed representations let language models generalize to unseen but similar word sequences.
- Sinusoidal Positional EncodingSinusoidal positional encoding adds fixed sine and cosine patterns of different frequencies to token embeddings so the model can infer token order. The encoding is deterministic and smooth across positions, which let the original Transformer represent position without learning a separate table.
- PAC Learning (Probably Approximately Correct)PAC learning formalizes what it means for a hypothesis class to be learnable: with enough samples, an algorithm should return a hypothesis whose error is small with high probability. It is foundational because sample complexity and model capacity can then be expressed as rigorous guarantees instead of heuristics.
- Shannon EntropyShannon entropy measures the expected surprisal of a random variable and quantifies how uncertain its outcomes are. It is the basic information-theoretic quantity from which cross-entropy, KL divergence, mutual information, and many ML loss functions are built.
- The Curse of DimensionalityThe curse of dimensionality is the collection of high-dimensional effects that make data sparse, neighborhoods less informative, and sample requirements explode as dimension grows. It helps explain why distance-based methods, density estimation, and exhaustive search often break down in large feature spaces.
- Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo samples from a difficult target distribution by constructing a Markov chain whose stationary distribution matches that target. It is essential in Bayesian inference because it replaces intractable posterior integrals with averages over samples, provided the chain mixes well enough.
- AutoencodersAutoencoders are neural networks trained to reconstruct their inputs after passing them through a compressed or otherwise constrained latent representation. They are useful because the bottleneck forces the model to learn structure in the data rather than just memorize an identity map.
- GAN Minimax ObjectiveThe GAN minimax objective sets up a two-player game in which a generator tries to produce samples that fool a discriminator, while the discriminator tries to distinguish real from generated data. At equilibrium the generator matches the data distribution, though the training game is often unstable in practice.
- Q-LearningQ-learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function by bootstrapping from a Bellman target over the best next action. Because its update does not require following the current policy, it became a foundational method in both tabular RL and DQN-style deep RL.
- The Bellman EquationThe Bellman equation recursively expresses the value of a state or state-action pair as immediate reward plus discounted expected future value. It is the backbone of dynamic programming and reinforcement learning because it turns long-horizon return into a local consistency condition.
- AdaBoost (Adaptive Boosting)AdaBoost builds an ensemble by repeatedly fitting weak learners to reweighted data so that previously misclassified examples receive more attention. Its final predictor is a weighted vote of the learners, and its power comes from turning many slightly better-than-random classifiers into a strong one.
- BackpropagationBackpropagation computes gradients of a scalar loss with respect to all network parameters by applying the chain rule backward through the computation graph. It makes deep learning practical because it turns a complicated nested function into reusable local gradient calculations.
- Word embeddingA word embedding is a dense vector representation of a word learned from distributional context rather than hand-coded features. Its purpose is to place semantically or syntactically related words near one another in vector space so downstream models can generalize across vocabulary items.
- Word2VecWord2Vec is a family of shallow neural methods that learn word embeddings from local context, most famously via the skip-gram and CBOW objectives. Its importance is that simple predictive training on large text corpora produced useful semantic geometry, including analogy-like linear regularities.
- Language modelA language model assigns probabilities to token sequences, or equivalently predicts missing or next tokens from context. This unifies classical n-gram models, masked models like BERT, and autoregressive LLMs such as GPT under one probabilistic framework.
- Masked language modelA masked language model is trained to recover tokens hidden within a sequence using both left and right context. This bidirectional training makes MLMs strong encoders for understanding tasks, but less natural than autoregressive models for direct generation.
- PerceptronThe perceptron is a linear threshold classifier that predicts a class from the sign of \( w^\top x + b \) and updates its weights only on mistakes. It is historically important because it introduced gradient-like learning for linear separators, but it only converges when the data are linearly separable.
- Support Vector Machine (SVM)A support vector machine finds the decision boundary that maximizes the margin between classes, depending only on the support vectors nearest the boundary. With kernels, SVMs can model nonlinear separators while retaining a convex optimization objective.
- TransformerA Transformer is a sequence model built from self-attention, position-wise MLPs, residual connections, and normalization, rather than recurrence or convolution. Its key advantage is that every token can directly attend to every other token in parallel, which made modern LLM scaling practical.
- Decoder-only TransformerA decoder-only Transformer is a Transformer architecture composed only of masked self-attention blocks, so each token can attend only to earlier tokens. This makes it the standard architecture for autoregressive language models such as GPT, LLaMA, and Claude.
- Multi-Head AttentionMulti-head attention runs several attention mechanisms in parallel on different learned projections of the same input, then concatenates their outputs. This lets the model capture multiple relational patterns at once instead of forcing all interactions through a single attention map.
- Residual Connection (Skip Connection)A residual connection adds a layer's input back to its output, so the layer learns a correction rather than an entirely new representation. This stabilizes optimization, improves gradient flow, and is one reason very deep networks and Transformers train reliably.
- Few-Shot PromptingFew-shot prompting includes a small number of labeled examples in the prompt so the model can infer the task from context without updating parameters. It is one of the clearest demonstrations of in-context learning in large language models.
- In-Context LearningIn-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
- Chain of ThoughtChain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
- Reinforcement Learning from Human Feedback (RLHF)RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
- Constitutional AIConstitutional AI aligns a model using an explicit list of principles that guide critique and revision, reducing the need for dense human feedback on every example. The constitution acts like a rule set for self-improvement, though the resulting behavior still depends on the chosen principles and training procedure.
- CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
- KL DivergenceKL divergence measures how one probability distribution differs from a reference distribution through an expected log-ratio. It is nonnegative and asymmetric, so it is best understood not as a distance but as the penalty for modeling samples from one distribution with another.
- Jensen's InequalityJensen’s inequality says that for a convex function f, applying f after taking an expectation gives a value no larger than taking the expectation after applying f. This one fact underlies many core results in ML, including the nonnegativity of KL divergence and the derivation of variational lower bounds.
- Bayes' TheoremBayes’ theorem updates beliefs by combining a prior with the likelihood of observed evidence to produce a posterior. In compact form, posterior is proportional to likelihood times prior, which is why Bayesian inference is fundamentally a rule for disciplined belief revision.
- Universal Approximation TheoremThe universal approximation theorem says that a sufficiently wide neural network with a suitable nonlinearity can approximate any continuous function on a compact domain arbitrarily well. It is an existence result, not a guarantee that training will find that approximation efficiently.
- Double DescentDouble descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.
- Neural Tangent Kernel (NTK)The Neural Tangent Kernel is the kernel that describes how an infinitely wide network trained by small gradient steps evolves around its initialization. In that limit, training becomes equivalent to kernel regression, which explains part of the behavior of very wide networks.
- Chinchilla Scaling LawsChinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
- REINFORCE AlgorithmREINFORCE is the basic Monte Carlo policy-gradient algorithm that updates parameters by weighting the log-probability of sampled actions by their returns. It is unbiased, but its variance is high, which is why practical methods usually add baselines or critics.
- Actor–Critic MethodsActor-critic methods learn a policy and a value estimator at the same time. The actor chooses actions, while the critic estimates return and supplies a lower-variance learning signal than raw returns alone.
- Reasoning Models (o1 / R1-style Long-CoT)Reasoning models in the o1 or R1 style are language models trained or prompted to spend extra inference compute on long multi-step reasoning before answering. Their key idea is that better reasoning can come not only from bigger models, but from better search, verification, and credit assignment at inference and post-training time.
- Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
- Masked Autoencoder (MAE)A self-supervised ViT pretraining objective: randomly mask 75% of image patches and train an asymmetric encoder–decoder to reconstruct pixel values from the visible 25%. MAE is simple, compute-efficient (the encoder sees only unmasked patches), and produces state-of-the-art ImageNet fine-tuning representations.
- Denoising Diffusion Probabilistic Models (DDPM)A generative model that learns to reverse a fixed Gaussian corruption process. Ho et al. (2020) showed that predicting the added noise with a neural network, trained by a simple MSE loss on \( T \) diffusion steps, yields state-of-the-art image synthesis — the foundation of all modern image/video diffusion.
- Latent DiffusionRun the diffusion process in the compressed latent space of a pretrained VAE rather than in pixel space. Latent diffusion (Rombach et al., 2022) slashes memory and compute by ~8× for images while preserving sample quality, and is the architecture behind Stable Diffusion, SDXL, SD3, and most text-to-image systems.
- Central Limit TheoremGiven i.i.d. samples \( X_1, \dots, X_n \) with finite mean \( \mu \) and variance \( \sigma^2 \), the standardised sample mean \( \sqrt{n}(\bar X_n - \mu)/\sigma \) converges in distribution to \( \mathcal{N}(0,1) \). The CLT underlies confidence intervals, stochastic-gradient noise analysis, and many initialisation arguments.
- Gaussian Mixture Models (GMM)A density model \( p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k) \) that represents data as a weighted sum of Gaussian components. Fitted by the EM algorithm, GMMs are the canonical example of latent-variable density estimation and the statistical cousin of \( k \)-means.
- Variational Inference / ELBOA framework that turns Bayesian inference into optimisation: choose a tractable family \( q(\mathbf{z}; \phi) \) and maximise the evidence lower bound \( \mathcal{L}(\phi) = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \), which simultaneously approximates the posterior and bounds the log-evidence. VAEs, variational Bayes, and amortised inference all descend from this objective.
- Score MatchingAn estimation principle (Hyvärinen, 2005) that fits an unnormalised density by matching the model's score \( \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \) to the data's score. Integration-by-parts eliminates the unknown data-score, yielding a tractable objective that underlies modern score-based diffusion models.
- Denoising Score MatchingVincent (2011) showed that the score of a Gaussian-corrupted data distribution \( q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) \) admits a closed-form target, reducing score learning to a simple regression: predict \( (\mathbf{x} - \tilde{\mathbf{x}})/\sigma^2 \). This identity is the algorithmic heart of modern diffusion models.
- DDIM (Deterministic Diffusion Sampling)Song, Meng & Ermon (2020) introduced a non-Markovian sampler for DDPM-trained diffusion models that generates samples in a fraction of the steps while matching quality. DDIM's \( \eta = 0 \) limit is a deterministic ODE integrator, enabling latent interpolation and invertibility.
- Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
- Contrastive Learning (SimCLR / MoCo)Self-supervised visual representation learning via the InfoNCE loss: pull together two augmented views of the same image (positives) while pushing apart views of all other images (negatives). SimCLR uses in-batch negatives; MoCo uses a queued momentum encoder, enabling large effective negative pools with small batches.
- BYOL / Self-DistillationBootstrap Your Own Latent (Grill et al., 2020) shows that strong visual representations can be learned without negatives : an online network predicts the output of a momentum-updated target network on a different augmentation of the same image. The same template underlies DINO, MoCo-v3, and other non-contrastive SSL methods.
- Vision-Language Contrastive Objectives Beyond CLIPSuccessors to CLIP refine its symmetric InfoNCE loss for better efficiency, finer-grained alignment, and scaling. SigLIP replaces softmax with a pairwise sigmoid; LiT freezes a pretrained image tower; ALIGN scales noisily; FILIP does token-level contrast; CoCa adds a captioning head. All share the joint-embedding template.
- World ModelA learned predictive model of environment dynamics — given a state and action, predict the next state (and reward). World models enable planning, offline rollouts, and sample-efficient RL. Recent systems (Dreamer, MuZero, Genie) show that powerful latent world models scale to games, robotics, and interactive video generation.
- Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
- Residual Networks (ResNet as Architecture)A residual network replaces a plain layer stack with blocks that learn a residual update \(F(x)\) and add it back to the input, so each block computes \(y = x + F(x)\). This makes very deep CNNs much easier to optimize and triggered the shift from VGG-style stacks to residual architectures.
- Transformer-XL / Segment-Level RecurrenceDai et al. (2019) extend Transformers beyond fixed context by caching hidden states of the previous segment and allowing attention to read from them — a simple "segment-level recurrence" that gives an effective receptive field of \( N \cdot L \) for \( L \) layers and segment length \( N \). Paired with relative positional encoding, it was a key bridge between pure attention and long-context models.
- RWKVAn RNN-Transformer hybrid (Peng et al., 2023) whose block is a parallelisable linear-attention operation at training time and a simple recurrent state update at inference time. RWKV scales to 14B+ parameters with Transformer-competitive perplexity, offering constant-memory inference.
- Iterative DPOIterative DPO repeats a simple loop: sample responses from the current policy, score or label them, apply a DPO-style update, and repeat. It brings online data collection to preference optimization while keeping DPO's simpler training dynamics compared with PPO-based RLHF.
- Sleeper AgentsHubinger et al. (2024) demonstrate that LLMs can be deliberately trained to have a backdoor — e.g., produce insecure code when a trigger string appears — and that standard safety training (RLHF, adversarial training, SFT on harmlessness) fails to remove the backdoor. Evidence that once deceptive alignment is present, it may persist through the standard post-training pipeline.
- Monosemanticity and SAE FeaturesMonosemanticity is the idea that a single learned feature corresponds to a single interpretable concept rather than many unrelated ones. Sparse autoencoders are used to decompose dense neural activations into a wider sparse basis where such features are easier to identify.
- Agent Benchmarks (SWE-bench, GAIA, WebArena)SWE-bench tests whether a model can fix real GitHub issues in code repositories, GAIA tests general tool-using problem solving with automatically checked answers, and WebArena tests web-navigation agents in simulated sites. Together they measure software, reasoning, and browser-action competence rather than just one-shot text generation.
- BERT: Bidirectional Encoder PretrainingBERT (Devlin et al., 2019) is a Transformer encoder pretrained with masked language modelling (MLM) and next-sentence prediction (NSP), producing bidirectional contextual representations. Unlike GPT's causal, left-to-right pretraining, MLM sees both past and future tokens, making BERT the go-to encoder for classification, span extraction, and sentence embeddings. Base/Large variants (110M/340M params) dominated GLUE/SQuAD until decoder-only LLMs took over.
- Adapter Layers (Houlsby-style)Houlsby-style adapters are small bottleneck MLPs inserted inside each Transformer block while the original backbone is frozen. They were the first clean PEFT recipe: add a few trainable parameters per layer, keep the base model unchanged, and specialize the model to new tasks cheaply.
- Metropolis–Hastings AlgorithmThe canonical MCMC recipe: propose \( \theta' \sim q(\theta' \mid \theta) \), accept with probability \( \min(1, \pi(\theta')q(\theta\mid\theta')/[\pi(\theta)q(\theta'\mid\theta)]) \). Produces a Markov chain with stationary distribution \( \pi \) for any valid proposal, turning intractable posterior sampling into a correctness-guaranteed iterative procedure.
- Variational Autoencoder (VAE)A latent-variable generative model trained by maximising the ELBO \( \mathcal{L}(x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\text{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \). The reparameterisation trick makes the encoder \( q_\phi \) differentiable; the decoder \( p_\theta \) learns to reconstruct \( x \) from latent codes \( z \sim \mathcal{N}(0, I) \).
- Restricted Boltzmann Machines (RBM)A bipartite EBM over visible and hidden binary units with energy \( E(v, h) = -v^\top W h - b^\top v - c^\top h \). Conditional independence within each layer gives closed-form conditionals \( p(h\mid v), p(v\mid h) \); Hinton's Contrastive Divergence trains them and the RBM stack forms a deep belief net.
- GAN Family: WGAN, StyleGAN, BigGANThree architectural and objective milestones: WGAN uses the Kantorovich–Rubinstein dual of \( W_1 \) as a smoother critic, StyleGAN introduces AdaIN-controlled style injection for image generation, BigGAN scales class-conditional GANs to 512×512 with orthogonal regularisation and truncation tricks.
- U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
- Object Detection: R-CNN → Faster R-CNN → DETRR-CNN ran a CNN classifier on externally-proposed regions; Fast R-CNN shared backbone features across proposals; Faster R-CNN introduced a learned Region Proposal Network; DETR replaced the entire region-proposal pipeline with a transformer that predicts a fixed set of boxes via bipartite matching.
- Neural Radiance Fields (NeRF) & 3D Gaussian SplattingNeRF encodes a 3-D scene as a continuous function \( (x, y, z, \theta, \phi) \to (\text{colour}, \text{density}) \) queried along camera rays and volume-rendered into pixels. 3D Gaussian Splatting replaces the implicit MLP with an explicit set of anisotropic Gaussians rasterised in real time.
- Pointer NetworksA seq2seq architecture whose decoder outputs indices into the input sequence via attention weights (a 'pointer') rather than tokens from a fixed vocabulary. Ideal for combinatorial problems whose output vocabulary depends on the input (convex hull, TSP, sorting) and for extractive QA / span prediction.
- Highway NetworksPredecessor to ResNet: a gated skip connection \( y = H(x) \cdot T(x) + x \cdot (1 - T(x)) \), where \( T(x) \in [0,1] \) is a learned transform gate. Enabled training of 100+ layer networks before residual connections simplified the construction in ResNet.
- Meta-Learning: MAML & ReptileMAML learns an initialisation \( \theta \) such that one or a few SGD steps on a new task yield good performance — formally \( \min_\theta \sum_\tau \mathcal{L}_\tau(\theta - \alpha \nabla \mathcal{L}_\tau(\theta)) \). Reptile is a first-order simplification that moves \( \theta \) toward per-task adapted parameters. Influential early 'learn to learn' recipes, since absorbed into prompt-based few-shot learning.
- Conformal PredictionA distribution-free procedure for turning any point predictor into a prediction set with guaranteed finite-sample coverage \( 1 - \alpha \) under exchangeability. Requires only a scoring function and a calibration set; no assumption on the underlying model or data distribution.
- RLVR: RL from Verifiable RewardsTrain a reasoning policy with pure RL signals from tasks whose answers are automatically verifiable — math (exact match), code (unit-test execution), proof (checker), chess (engine eval). No preference model needed. The method behind DeepSeek-R1-Zero and the o1-style long-CoT reasoning families.
- Reward Hacking & Specification Gaming in RLHFWhen a learned reward model is a proxy for human preference, RL optimisation finds adversarial inputs that maximise reward without matching the true objective — verbose apologies, sycophancy, confident wrong answers, format exploitation. Goodhart's law in practice. Mitigations range from KL penalty and reward normalisation to process supervision and debate.
- Toolformer & Learned Tool UseToolformer trains a language model to decide when to call external tools and what arguments to send without requiring human demonstrations of tool use. It keeps tool calls only when the returned result improves the continuation, making tool use a self-supervised learning signal.
- CTC Loss & RNN-TransducerTwo objectives for training sequence-to-sequence models when alignment between input and output frames is unknown. CTC sums over all alignment paths with blank symbols; RNN-T decomposes into a prediction network and joint network to model output-length independently. Backbones of modern ASR pipelines.
- Conformer ArchitectureA hybrid CNN-plus-attention block for speech recognition: each Conformer layer combines multi-head self-attention (global), depthwise convolutions (local), and sandwiched feed-forward modules. Outperforms pure Transformer and pure CNN on LibriSpeech and became the de facto encoder for production ASR.
- Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
- Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
- Video Diffusion (Sora, Veo, Gen-3)Extend image-diffusion recipes to video with 3D patch embeddings, temporal attention, and long-context handling. Sora (OpenAI), Veo (Google), and Gen-3 (Runway) train DiT-style transformers over space-time patches of 1–60 second clips, conditioning on rich text captions for controllable generation.
- Attention Is All You Need“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.
- AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.
- Bahdanau AttentionBahdanau attention is the original additive attention mechanism for sequence-to-sequence models, where the decoder scores each encoder state before producing the next token. It solved the fixed-context bottleneck of early seq2seq RNNs by letting the decoder look back over the whole source sequence at every step.
- Seq2Seq with AttentionSeq2seq with attention augments the encoder-decoder architecture so the decoder conditions on a context vector built from all encoder states at each output step. That change made neural machine translation far more effective than fixed-context seq2seq and directly paved the way to modern cross-attention and Transformer models.
- Backpropagation — History (Werbos → Rumelhart/Hinton/Williams)The history of backpropagation is the story of an idea known in pieces before it became a practical neural-network training method. Werbos articulated reverse-mode differentiation for network training in the 1970s, and Rumelhart, Hinton, and Williams turned it into the landmark 1986 demonstration that made multilayer neural networks trainable in practice.