History of Machine Learning & LLMs

Disclaimer: I used various LLMs to generate some of the data for this timeline.

1957
1957
Perceptron at Cornell Aeronautical Laboratory

Perceptron

Cornell Aeronautical Laboratory

Frank Rosenblatt

Paper: The Perceptron: A Perceiving and Recognizing Automaton

The Perceptron was the first model that could learn the weights defining categories given examples from each category. It established the foundation for artificial neural networks by introducing a learning algorithm that could automatically adjust connection weights. The perceptron demonstrated that machines could learn from experience, marking a fundamental breakthrough in machine learning.

1980
1980
Neocognitron at NHK Science & Technical Research Laboratories

Neocognitron

NHK Science & Technical Research Laboratories

Kunihiko Fukushima

Paper: Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position

The Neocognitron was a hierarchical, multilayered neural network inspired by the visual cortex. It introduced the concepts of S-cells (simple cells) and C-cells (complex cells) arranged in a hierarchy, allowing for position-invariant pattern recognition. This architecture laid the groundwork for modern convolutional neural networks and demonstrated that local feature extraction combined with spatial pooling could achieve robust visual recognition.

1986
1986
Backpropagation at University of California San Diego, Carnegie Mellon University, University of Toronto

Backpropagation

University of California San Diego, Carnegie Mellon University, University of Toronto

David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams

Paper: Learning Representations by Back-propagating Errors

Backpropagation provided an efficient method for training multi-layer neural networks by computing gradients through the chain rule. This algorithm enabled the training of deep networks by propagating error signals backwards through layers, allowing hidden units to learn internal representations. Backpropagation became the workhorse of neural network training and remains fundamental to modern deep learning.

1997
1997
Long Short-Term Memory (LSTM) at Technische Universität München

Long Short-Term Memory (LSTM)

Technische Universität München

Sepp Hochreiter, Jürgen Schmidhuber

Paper: Long Short-Term Memory

LSTM addressed the vanishing gradient problem in recurrent neural networks by introducing memory cells with gating mechanisms. The architecture uses input, output, and forget gates to control information flow, enabling networks to learn long-term dependencies. LSTM became the dominant architecture for sequence modeling tasks including speech recognition, machine translation, and time series prediction before the transformer era.

1998
1998
Convolutional Neural Networks (LeNet) at AT&T Bell Laboratories

Convolutional Neural Networks (LeNet)

AT&T Bell Laboratories

Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner

Paper: Gradient-Based Learning Applied to Document Recognition

LeNet introduced a practical convolutional neural network architecture for document recognition. It combined convolutional layers for local feature extraction, pooling for spatial invariance, and fully connected layers for classification. This architecture demonstrated that CNNs could be trained end-to-end using backpropagation and achieved state-of-the-art results on handwritten digit recognition, establishing the blueprint for modern computer vision systems.

2003
2003
The Neural Probabilistic Language Model at Université de Montréal

The Neural Probabilistic Language Model

Université de Montréal

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin

Paper: A Neural Probabilistic Language Model

The Neural Probabilistic Language Model addressed the curse of dimensionality in language modeling by learning distributed representations for words. It introduced the idea that similar words would have similar vector representations, allowing the model to generalize to unseen word sequences. This foundational work pioneered the use of neural networks for language modeling and word embeddings, directly inspiring Word2Vec and modern language models.

2009
2009
ImageNet Dataset at Princeton University

ImageNet Dataset

Princeton University

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei

Paper: ImageNet: A Large-Scale Hierarchical Image Database

ImageNet created a large-scale dataset with over 14 million labeled images across thousands of categories, organized hierarchically using WordNet. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the premier benchmark for computer vision. ImageNet's scale and diversity enabled the training of deep neural networks and catalyzed the deep learning revolution, particularly with AlexNet's breakthrough in 2012.

2010
2010
Xavier/Glorot Initialization at Université de Montréal

Xavier/Glorot Initialization

Université de Montréal

Xavier Glorot, Yoshua Bengio

Paper: Understanding the Difficulty of Training Deep Feedforward Neural Networks

Xavier initialization provided a principled method for initializing neural network weights to maintain consistent variance of activations and gradients across layers. By scaling initial weights based on the number of input and output connections, it prevented vanishing or exploding gradients during training. This simple but crucial technique enabled the training of much deeper networks and remains a standard practice in deep learning.

2010
2010
Rectified Linear Unit (ReLU) Activation at University of Toronto

Rectified Linear Unit (ReLU) Activation

University of Toronto

Vinod Nair, Geoffrey E. Hinton

Paper: Rectified Linear Units Improve Restricted Boltzmann Machines

ReLU introduced a simple non-saturating activation function f(x) = max(0, x) that addressed the vanishing gradient problem of sigmoid and tanh activations. ReLU enabled faster training, reduced computational cost, and induced sparsity in neural networks. Despite its simplicity, ReLU became the default activation function for deep neural networks and enabled the training of much deeper architectures.

2012
2012
AlexNet at University of Toronto

AlexNet

University of Toronto

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Paper: ImageNet Classification with Deep Convolutional Neural Networks

GitHub repo: cuda-convnet2

AlexNet won the ImageNet 2012 competition with a significant margin, demonstrating that deep convolutional networks trained with GPUs could dramatically outperform traditional computer vision methods. The architecture combined ReLU activations, dropout regularization, data augmentation, and GPU training. AlexNet's success marked the beginning of the deep learning era and sparked intense interest in neural networks across academia and industry.

2012
2012
Dropout at University of Toronto

Dropout

University of Toronto

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov

Paper: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Dropout introduced a powerful regularization technique by randomly dropping units during training, preventing co-adaptation of features. This simple method significantly reduced overfitting in deep neural networks by training an ensemble of exponentially many sub-networks. Dropout became a standard regularization technique and enabled the training of larger networks without excessive overfitting.

2013
2013
Word2Vec at Google

Word2Vec

Google

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

Paper: Efficient Estimation of Word Representations in Vector Space

GitHub repo: word2vec

Word2Vec introduced efficient methods (Skip-gram and CBOW) for learning dense vector representations of words from large corpora. These embeddings captured semantic and syntactic relationships, enabling vector arithmetic like 'king' - 'man' + 'woman' ≈ 'queen'. Word2Vec revolutionized natural language processing by providing a scalable way to represent words as continuous vectors, becoming foundational for modern NLP.

2013
2013
Variational Autoencoder (VAE) at University of Amsterdam

Variational Autoencoder (VAE)

University of Amsterdam

Diederik P. Kingma, Max Welling

Paper: Auto-Encoding Variational Bayes

VAE introduced a probabilistic approach to learning latent representations by combining variational inference with neural networks. It learns a distribution over latent codes rather than deterministic encodings, enabling both efficient inference and generation. VAE provided a principled framework for generative modeling and became influential in unsupervised learning, representation learning, and generative AI.

2014
2014
Generative Adversarial Network (GAN) at Université de Montréal

Generative Adversarial Network (GAN)

Université de Montréal

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Paper: Generative Adversarial Networks

GAN introduced a game-theoretic framework where a generator network learns to create realistic data by competing against a discriminator network. This adversarial training process enabled the generation of highly realistic images without requiring explicit modeling of probability distributions. GANs revolutionized generative modeling and spawned numerous applications in image synthesis, style transfer, and data augmentation.

2014
2014
Adam Optimizer at OpenAI, University of Toronto

Adam Optimizer

OpenAI, University of Toronto

Diederik P. Kingma, Jimmy Ba

Paper: Adam: A Method for Stochastic Optimization

Adam combined the benefits of AdaGrad and RMSProp by computing adaptive learning rates for each parameter using estimates of first and second moments of gradients. It included bias correction terms and proved robust across a wide range of problems with minimal hyperparameter tuning. Adam became the most widely used optimizer in deep learning due to its efficiency, ease of use, and strong empirical performance.

2014
2014
Sequence-to-Sequence Learning at Google

Sequence-to-Sequence Learning

Google

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

Paper: Sequence to Sequence Learning with Neural Networks

GitHub repo: seq2seq

Seq2Seq introduced an end-to-end framework for sequence transduction using an encoder-decoder architecture with LSTMs. The encoder maps variable-length input sequences to fixed-size representations, which the decoder transforms into variable-length output sequences. This architecture unified many NLP tasks under a single framework and achieved breakthrough results in machine translation, establishing neural approaches as state-of-the-art.

2014
2014
Attention Mechanism at Université de Montréal

Attention Mechanism

Université de Montréal

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau attention addressed the bottleneck in sequence-to-sequence models by allowing the decoder to focus on different parts of the input sequence at each decoding step. This attention mechanism computed context vectors as weighted sums of encoder hidden states, where weights were learned based on relevance. Attention became a fundamental building block of modern NLP systems and directly inspired the transformer architecture.

2014
2014
GloVe Word Embeddings at Stanford University

GloVe Word Embeddings

Stanford University

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Paper: GloVe: Global Vectors for Word Representation

GitHub repo: GloVe

GloVe combined global matrix factorization with local context window methods for learning word embeddings. It trained on aggregated word-word co-occurrence statistics to produce vectors with meaningful linear substructures. GloVe provided an alternative to Word2Vec with strong performance on word analogy and similarity tasks, and its pre-trained vectors became widely used in NLP applications.

2014
2014
Neural Turing Machine at Google DeepMind

Neural Turing Machine

Google DeepMind

Alex Graves, Greg Wayne, Ivo Danihelka

Paper: Neural Turing Machines

Neural Turing Machines extended neural networks by coupling them to external memory resources accessed through attention mechanisms. The entire system was differentiable end-to-end, allowing gradient-based training. NTMs demonstrated that neural networks could learn simple algorithms like copying, sorting, and associative recall from examples alone, showing that neural networks could exhibit more algorithmic and programmable behavior.

2015
2015
Batch Normalization at Google

Batch Normalization

Google

Sergey Ioffe, Christian Szegedy

Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization normalized layer inputs across mini-batches, stabilizing training by reducing internal covariate shift. It enabled much higher learning rates, reduced sensitivity to initialization, and acted as a regularizer. Batch normalization dramatically accelerated training and became a standard component in deep networks, enabling the training of very deep architectures that were previously difficult to optimize.

2015
2015
Residual Networks (ResNet) at Microsoft Research

Residual Networks (ResNet)

Microsoft Research

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Paper: Deep Residual Learning for Image Recognition

GitHub repo: deep-residual-networks

ResNet introduced skip connections that allowed gradients to flow directly through networks by learning residual mappings. This simple architectural change enabled the training of networks with hundreds or even thousands of layers without degradation problems. ResNet won ImageNet 2015 and demonstrated that very deep networks could be effectively trained, fundamentally changing how we design neural network architectures.

2015
2015
Luong Attention (Global and Local Attention) at Stanford University

Luong Attention (Global and Local Attention)

Stanford University

Minh-Thang Luong, Hieu Pham, Christopher D. Manning

Paper: Effective Approaches to Attention-based Neural Machine Translation

Luong attention introduced two complementary attention mechanisms for neural machine translation: global attention, which attends to all source words, and local attention, which focuses on a subset of source positions. The paper also proposed multiplicative (dot-product) attention as a simpler alternative to additive attention. These mechanisms achieved significant improvements over non-attentional systems, with the local attention approach gaining 5.0 BLEU points. The work established a new state-of-the-art on WMT'15 English-German translation and influenced subsequent attention designs in transformers.

2016
2016
Layer Normalization at University of Toronto

Layer Normalization

University of Toronto

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

Paper: Layer Normalization

Layer Normalization normalized inputs across features for each example independently, unlike batch normalization which normalized across the batch dimension. This made it particularly effective for recurrent neural networks and sequences of varying length. Layer normalization stabilized hidden state dynamics in RNNs and later became the standard normalization technique in transformer architectures.

2016
2016
Subword Units (BPE): Solving the Rare Word Problem at University of Edinburgh

Subword Units (BPE): Solving the Rare Word Problem

University of Edinburgh

Rico Sennrich, Barry Haddow, Alexandra Birch

Paper: Neural Machine Translation of Rare Words with Subword Units

GitHub repo: subword-nmt

Byte-Pair Encoding (BPE) adapted a data compression algorithm for neural machine translation, enabling open-vocabulary learning by breaking words into subword units. This solved the rare word problem by representing infrequent words as sequences of common subwords. BPE became the standard tokenization approach for language models, enabling models to handle any word while maintaining reasonable vocabulary sizes, and is used in GPT, BERT, and most modern LLMs.

2016
2016
Key-Value Memory Networks at Facebook AI Research, Carnegie Mellon University

Key-Value Memory Networks

Facebook AI Research, Carnegie Mellon University

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, Jason Weston

Paper: Key-Value Memory Networks for Directly Reading Documents

Key-Value Memory Networks introduced a memory architecture that separates keys (used for addressing) from values (used for reading), enabling more effective question answering by directly reading documents. This separation allowed the model to use different encodings for matching queries to memory slots versus returning information, significantly improving performance on knowledge base and document-based QA tasks. The architecture influenced subsequent memory-augmented networks and retrieval-augmented generation systems.

2017
2017
Transformer Architecture at Google

Transformer Architecture

Google

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Paper: Attention Is All You Need

GitHub repo: tensor2tensor

The Transformer replaced recurrence and convolutions entirely with self-attention mechanisms, processing sequences in parallel rather than sequentially. It introduced multi-head attention, positional encodings, and a feedforward encoder-decoder structure. The Transformer achieved state-of-the-art translation results while being more parallelizable and requiring significantly less training time. This architecture became the foundation for modern large language models and revolutionized NLP.

2017
2017
Reinforcement Learning from Human Feedback (RLHF) at OpenAI, UC Berkeley, DeepMind

Reinforcement Learning from Human Feedback (RLHF)

OpenAI, UC Berkeley, DeepMind

Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Paper: Deep Reinforcement Learning from Human Preferences

RLHF introduced a method for training RL agents using human preference comparisons rather than hand-crafted reward functions. Humans compared pairs of trajectory segments, and a reward model was trained to predict preferences. This reward model then guided policy optimization. RLHF scaled preference-based learning to complex tasks and later became crucial for aligning large language models with human values and intentions.

2017
2017
Sparsely-Gated Mixture of Experts at Google

Sparsely-Gated Mixture of Experts

Google

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Paper: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Mixture of Experts introduced conditional computation where a gating network routes each input to a sparse subset of expert sub-networks. This enabled training models with orders of magnitude more parameters without proportional increases in computation. MoE demonstrated that model capacity could be dramatically increased through sparsity, achieving state-of-the-art results in language modeling and translation. This approach later influenced large-scale models like GPT-4.

2017
2017
Proximal Policy Optimization (PPO) at OpenAI

Proximal Policy Optimization (PPO)

OpenAI

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

Paper: Proximal Policy Optimization Algorithms

GitHub repo: baselines

PPO introduced a simpler and more stable policy gradient method by clipping the objective function to prevent excessively large policy updates. It combined the benefits of trust region methods with the simplicity of first-order optimization. PPO became the most widely used reinforcement learning algorithm due to its robustness, ease of implementation, and strong empirical performance across diverse tasks.

2018
2018
ELMo (Embeddings from Language Models) at Allen Institute for AI, University of Washington

ELMo (Embeddings from Language Models)

Allen Institute for AI, University of Washington

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Paper: Deep Contextualized Word Representations

GitHub repo: bilm-tf

ELMo generated context-dependent word representations by using bidirectional LSTMs trained as language models. Unlike static embeddings, ELMo representations varied based on context, capturing polysemy and complex linguistic features. ELMo demonstrated the power of pre-training and fine-tuning, significantly improving performance across diverse NLP tasks. It was a crucial step toward modern contextualized language models and transfer learning in NLP.

2018
2018
GPT (Generative Pre-Training) at OpenAI

GPT (Generative Pre-Training)

OpenAI

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

Paper: Improving Language Understanding by Generative Pre-Training

GitHub repo: finetune-transformer-lm

GPT introduced a two-stage approach: unsupervised pre-training of a transformer language model on large text corpora, followed by supervised fine-tuning on specific tasks. This demonstrated that language models could learn general representations useful across many tasks. GPT showed that pre-training could significantly reduce the labeled data required for downstream tasks, establishing the pre-train-then-fine-tune paradigm that dominated subsequent NLP research.

2018
2018
BERT (Bidirectional Encoder Representations from Transformers) at Google

BERT (Bidirectional Encoder Representations from Transformers)

Google

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

GitHub repo: bert

BERT pre-trained bidirectional transformers using masked language modeling and next sentence prediction. Unlike previous unidirectional models, BERT jointly conditioned on both left and right context in all layers. BERT achieved state-of-the-art results across eleven NLP tasks and demonstrated that deeply bidirectional pre-training was crucial for language understanding. BERT became the foundation for numerous downstream applications and variants.

2018
2018
Mixed Precision Training at NVIDIA

Mixed Precision Training

NVIDIA

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Olaf Klauser, Andrew Kraljevic, Chris Paine, Naveen Satish, Michael Wu

Paper: Mixed Precision Training

GitHub repo: apex

Micikevicius et al. showed how to safely train deep networks using half-precision (FP16) arithmetic while preserving full-precision accuracy. By keeping FP32 master weights, accumulating gradients in FP32, and using loss scaling to avoid underflow, they demonstrated 2–3× speedups on NVIDIA Tensor Cores without sacrificing convergence. Mixed precision became the standard recipe for large-scale transformer training, enabling today's models to fit within GPU memory budgets.

2019
2019
GPT-2 at OpenAI

GPT-2

OpenAI

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

Paper: Language Models are Unsupervised Multitask Learners

GitHub repo: gpt-2

GPT-2 scaled up the original GPT to 1.5 billion parameters and trained on a larger, more diverse dataset. It demonstrated that language models could perform many tasks zero-shot without fine-tuning by simply conditioning on appropriate prompts. GPT-2 showed strong performance on diverse tasks including translation, summarization, and question answering, suggesting that with sufficient scale and data, language models naturally learn multitask capabilities.

2019
2019
T5 (Text-to-Text Transfer Transformer) at Google

T5 (Text-to-Text Transfer Transformer)

Google

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

GitHub repo: text-to-text-transfer-transformer

T5 unified all NLP tasks into a text-to-text format where both inputs and outputs are text strings. It systematically explored transfer learning techniques including pre-training objectives, architectures, datasets, and fine-tuning methods. T5's encoder-decoder architecture and comprehensive evaluation provided insights into what makes transfer learning effective. The unified framework simplified multi-task learning and became influential for instruction-following models.

2019
2019
The Bitter Lesson at University of Alberta, DeepMind

The Bitter Lesson

University of Alberta, DeepMind

Richard Sutton

Paper: The Bitter Lesson (Essay)

The Bitter Lesson essay argued that general methods leveraging computation consistently outperform approaches that rely on human knowledge in the long run. Sutton observed that search and learning, when given sufficient computation, surpass hand-crafted features and domain expertise. This philosophical perspective influenced the field to focus on scalable learning methods rather than encoding human knowledge, providing intellectual foundation for the scaling paradigm in modern AI.

2020
2020
Scaling Laws for Neural Language Models at OpenAI

Scaling Laws for Neural Language Models

OpenAI

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Paper: Scaling Laws for Neural Language Models

This work empirically demonstrated that language model performance scales as power-laws with model size, dataset size, and compute budget. The research showed predictable relationships between these factors and suggested optimal allocation strategies. These scaling laws provided quantitative guidance for training large models and predicted that simply scaling up models would continue to yield improvements, influencing subsequent investment in large-scale model development.

2020
2020
GPT-3 at OpenAI

GPT-3

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Paper: Language Models are Few-Shot Learners

GPT-3 scaled transformers to 175 billion parameters, demonstrating that language models could perform diverse tasks with few-shot, one-shot, or zero-shot learning from prompts alone. It showed impressive performance on translation, question-answering, arithmetic, and novel word usage without gradient updates. GPT-3 revealed that with sufficient scale, language models develop broad capabilities and sparked widespread interest in large language models and prompt engineering.

2020
2020
ZeRO (Zero Redundancy Optimizer) at Microsoft

ZeRO (Zero Redundancy Optimizer)

Microsoft

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

GitHub repo: DeepSpeed

ZeRO eliminated memory redundancies in data-parallel distributed training by partitioning optimizer states, gradients, and parameters across devices rather than replicating them. ZeRO enabled training models with trillions of parameters by dramatically reducing per-device memory requirements while maintaining computational efficiency. This optimization became crucial for training large language models and is implemented in DeepSpeed, enabling the scale of models like GPT-3 and beyond.

2021
2021
RoFormer: Rotary Position Embedding (RoPE) at Zhuiyi Technology

RoFormer: Rotary Position Embedding (RoPE)

Zhuiyi Technology

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu

Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding

GitHub repo: roformer

Rotary Position Embedding (RoPE) encodes position information by rotating word embeddings based on their absolute positions, while naturally encoding relative position information through the rotation properties. RoPE provided better extrapolation to longer sequences than previous position encoding methods while being computationally efficient. It was adopted by influential models including PaLM, LLaMA, and many other modern LLMs, becoming a preferred position encoding technique.

2021
2021
LoRA: Low-Rank Adaptation of Large Language Models at Microsoft

LoRA: Low-Rank Adaptation of Large Language Models

Microsoft

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

Paper: LoRA: Low-Rank Adaptation of Large Language Models

GitHub repo: LoRA

LoRA enabled efficient fine-tuning of large language models by training low-rank decomposition matrices that are added to frozen pre-trained weights. This reduced trainable parameters by 10,000x and memory requirements by 3x while maintaining or exceeding full fine-tuning performance. LoRA made it practical to customize large models for specific tasks with limited compute resources, democratizing access to fine-tuning and enabling rapid adaptation of foundation models.

2022
2022
InstructGPT at OpenAI

InstructGPT

OpenAI

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Paper: Training Language Models to Follow Instructions with Human Feedback

InstructGPT fine-tuned GPT-3 using supervised learning on human-written demonstrations followed by reinforcement learning from human feedback. Despite having 100x fewer parameters, InstructGPT outputs were preferred to GPT-3 outputs. The model showed improvements in truthfulness, helpfulness, and reduced toxicity. InstructGPT demonstrated that alignment with human preferences through RLHF was crucial for making language models useful and safe, establishing the approach used in ChatGPT.

2022
2022
Chain-of-Thought Prompting at Google Research

Chain-of-Thought Prompting

Google Research

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-thought prompting enabled language models to solve complex reasoning tasks by generating intermediate reasoning steps before arriving at final answers. Simply adding a few examples with reasoning chains dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. This technique revealed emergent reasoning capabilities in large models and demonstrated that prompting strategies could unlock latent abilities without additional training.

2022
2022
FlashAttention at Stanford University

FlashAttention

Stanford University

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

GitHub repo: flash-attention

FlashAttention optimized the attention mechanism by accounting for GPU memory hierarchy, using tiling to reduce data movement between GPU memory levels. This IO-aware algorithm achieved exact attention with significantly reduced memory usage and 2-4x speedup compared to standard implementations. FlashAttention enabled training transformers with much longer context lengths and became widely adopted, fundamentally improving the efficiency of transformer models.

2022
2022
Constitutional AI: Harmlessness from AI Feedback at Anthropic

Constitutional AI: Harmlessness from AI Feedback

Anthropic

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan

Paper: Constitutional AI: Harmlessness from AI Feedback

Constitutional AI introduced a method for training harmless AI assistants using AI-generated feedback based on a set of principles (a 'constitution') rather than relying solely on human feedback. The model critiques and revises its own responses according to constitutional principles, then learns from these self-improvements. This approach reduced reliance on human labelers for harmlessness training while making the values guiding AI behavior more transparent and debuggable.

2023
2023
Direct Preference Optimization (DPO) at Stanford University

Direct Preference Optimization (DPO)

Stanford University

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

GitHub repo: direct-preference-optimization

DPO simplified preference learning by directly optimizing language models on human preferences without requiring a separate reward model or reinforcement learning. It reformulated RLHF as a classification problem over preference pairs, making training more stable and efficient. DPO achieved comparable or better results than RLHF while being simpler to implement and tune, becoming a popular alternative for aligning language models with human preferences.

2023
2023
QLoRA: Efficient Fine-Tuning of Quantized LLMs at University of Washington

QLoRA: Efficient Fine-Tuning of Quantized LLMs

University of Washington

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Paper: QLoRA: Efficient Finetuning of Quantized LLMs

GitHub repo: qlora

QLoRA combined quantization with LoRA to enable fine-tuning of extremely large models on consumer hardware. It quantized the base model to 4-bit precision while using LoRA adapters in higher precision, maintaining full fine-tuning performance. QLoRA made it possible to fine-tune a 65B parameter model on a single GPU with 48GB memory, dramatically democratizing access to fine-tuning large language models and enabling researchers with limited resources to customize state-of-the-art models.

2024
2024
Mixture-of-Experts (MoE) and Test-Time Compute Scaling at Google, xAI, DeepSeek

Mixture-of-Experts (MoE) and Test-Time Compute Scaling

Google, xAI, DeepSeek

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

Paper: Mixtral of Experts

Modern Mixture-of-Experts architectures like Mixtral, Grok, and DeepSeek-V2 combined sparse routing with test-time compute scaling, allowing models to dynamically allocate computation based on task difficulty. These architectures activated only a subset of parameters per token while maintaining large total capacity, achieving better performance-per-compute ratios. The combination with test-time scaling, where models use more computation for harder problems, represented a shift toward more efficient and adaptive AI systems.

2024
2024
Layer Dropping and Progressive Pruning (TrimLLM) at Northeastern University, Indiana University Bloomington, University of Connecticut, University of Massachusetts Dartmouth, North Carolina State University

Layer Dropping and Progressive Pruning (TrimLLM)

Northeastern University, Indiana University Bloomington, University of Connecticut, University of Massachusetts Dartmouth, North Carolina State University

Lei Lu, Zhepeng Wang, Runyu Peng, Mengbing Wang, Fangyi Zhu, Zilong Wang, Hong Xu, Shangguang Wang

Paper: TrimLLM: Progressive Layer Dropping for Efficient LLM Inference

Layer dropping and progressive pruning techniques enabled efficient inference by selectively skipping or removing transformer layers based on input characteristics or layer importance. Research showed that many layers in large language models are redundant for certain tasks, and adaptive layer selection could maintain performance while reducing computation. These techniques became important for deploying large models in resource-constrained environments and improving inference efficiency.

2024
2024
Multimodal Secure Alignment at Carnegie Mellon University, University of Washington

Multimodal Secure Alignment

Carnegie Mellon University, University of Washington

Xuguang Wang, Xin Eric Wang

Paper: Defending Against Jailbreak Attacks in Multimodal Language Models

Multimodal secure alignment addresses unique safety challenges when language models process multiple modalities (text, images, audio). Research revealed that multimodal models could be more vulnerable to jailbreaks through adversarial images or cross-modal attacks. New alignment techniques were developed to ensure consistent safety behavior across modalities, including modality-specific safety layers and cross-modal consistency checking. This work became critical as vision-language models like GPT-4V and Gemini became widely deployed.

2025
2025
Chain-of-Thought Monitorability at Google DeepMind, Anthropic

Chain-of-Thought Monitorability

Google DeepMind, Anthropic

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah

Paper: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

This work reframed chain-of-thought (CoT) safety monitoring around monitorability rather than faithfulness, distinguishing CoT-as-rationalization from CoT-as-computation. By making harmful behaviors require multi-step reasoning, the authors forced models to expose their plans and showed that CoT monitoring can detect severe risks unless attackers receive substantial assistance. The paper also offered stress-testing guidelines, concluding that CoT monitoring remains a valuable, if imperfect, layer of defense that warrants active protection and continual evaluation.

2025
2025
Critical Representation Fine-Tuning (CRFT) at Zhejiang University, Alibaba Cloud Computing

Critical Representation Fine-Tuning (CRFT)

Zhejiang University, Alibaba Cloud Computing

Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye

Paper: Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning

CRFT extends Representation Fine-Tuning by identifying "critical" hidden representations that aggregate or gate reasoning signals, then editing them in a low-rank subspace while freezing the base LLaMA or Mistral weights. Information-flow analysis selects these high-leverage states, enabling large CoT accuracy gains across eight arithmetic and commonsense benchmarks as well as sizeable one-shot improvements. The work highlights that representation-level PEFT can unlock better reasoning without touching most model parameters.

2025
2025
Mechanistic OOCR Steering Vectors at Massachusetts Institute of Technology, Independent Researchers

Mechanistic OOCR Steering Vectors

Massachusetts Institute of Technology, Independent Researchers

Atticus Wang, Joshua Engels, Oliver Clive-Griffin, Senthooran Rajamanoharan, Neel Nanda

Paper: Simple Mechanistic Explanations for Out-Of-Context Reasoning

This study dissects out-of-context reasoning (OOCR) and finds that many reported cases arise because LoRA fine-tuning effectively adds a constant steering vector that pushes models toward latent concepts. By extracting or directly training such steering vectors, the authors reproduce OOCR across risky/safe decision, function, location, and backdoor benchmarks, showing that unconditional steering can even implement conditional behaviors. The results provide a simple mechanistic account of why fine-tuned LLMs can generalize far beyond their training distribution and highlight the alignment implications of steering-vector interventions.

2025
2025
Continuous Thought Machines (CTM) at Sakana AI

Continuous Thought Machines (CTM)

Sakana AI

Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

Paper: Continuous Thought Machines

GitHub repo: ctm

The Continuous Thought Machine (CTM) introduces a novel neural network architecture that integrates neuron-level temporal processing and neural synchronization to reintroduce neural timing as a foundational element in artificial intelligence. Unlike standard neural networks that ignore the complexity of individual neurons, the CTM leverages neural dynamics as its core representation through two key innovations: neuron-level temporal processing where each neuron uses unique weight parameters to process incoming signal histories, and neural synchronization as a latent representation. The CTM demonstrates strong performance across diverse tasks including ImageNet-1K classification, 2D maze solving, sorting, parity computation, question-answering, and reinforcement learning, while naturally supporting adaptive computation where it can stop earlier for simpler tasks or continue processing for more challenging instances.

2026
2026
Constitutional Classifiers++ at Anthropic

Constitutional Classifiers++

Anthropic

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, Alek Dimitriev, Rob Gilson, Logan Howard, Yijin Hua, Jared Kaplan, Jan Leike, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli, Clare O'Hara, Jin Pan, Nikhil Saxena, Alex Silverstein, Yue Song, Xunjie Yu, Giulio Zhou, Ethan Perez, Mrinank Sharma

Paper: Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Constitutional Classifiers++ delivers production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. The system combines exchange classifiers that evaluate model responses in full conversational context, a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers, and efficient linear probe classifiers ensembled with external classifiers. These techniques achieve a 40x computational cost reduction compared to baseline exchange classifiers while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, the work demonstrates strong protection against universal jailbreaks.