History of Machine Learning & LLMs

Disclaimer: I used various LLMs to generate some of the data for this timeline.

Impact score is a citation-driven editorial score. It uses relative citation rank, adds a small recency bonus, and the default cutoff hides a few more provisional entries.

Min Impact score: 45
1957
1957
Perceptron at Cornell Aeronautical Laboratory

Perceptron

Cornell Aeronautical Laboratory

Frank Rosenblatt

Impact 47 · 2,525 citations

Paper: The Perceptron: A Perceiving and Recognizing Automaton

The Perceptron was the first model that could learn the weights defining categories given examples from each category. It established the foundation for artificial neural networks by introducing a learning algorithm that could automatically adjust connection weights. The perceptron demonstrated that machines could learn from experience, marking a fundamental breakthrough in machine learning.

1980
1980
Neocognitron at NHK Science & Technical Research Laboratories

Neocognitron

NHK Science & Technical Research Laboratories

Kunihiko Fukushima

Impact 60 · 6,200 citations

Paper: Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position

The Neocognitron was a hierarchical, multilayered neural network inspired by the visual cortex. It introduced the concepts of S-cells (simple cells) and C-cells (complex cells) arranged in a hierarchy, allowing for position-invariant pattern recognition. This architecture laid the groundwork for modern convolutional neural networks and demonstrated that local feature extraction combined with spatial pooling could achieve robust visual recognition.

1986
1986
Backpropagation at University of California San Diego, Carnegie Mellon University, University of Toronto

Backpropagation

University of California San Diego, Carnegie Mellon University, University of Toronto

David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams

Impact 81 · 29,607 citations

Paper: Learning Representations by Back-propagating Errors

Backpropagation provided an efficient method for training multi-layer neural networks by computing gradients through the chain rule. This algorithm enabled the training of deep networks by propagating error signals backwards through layers, allowing hidden units to learn internal representations. Backpropagation became the workhorse of neural network training and remains fundamental to modern deep learning.

1997
1997
Long Short-Term Memory (LSTM) at Technische Universität München

Long Short-Term Memory (LSTM)

Technische Universität München

Sepp Hochreiter, Jürgen Schmidhuber

Impact 94 · 139,920 citations

Paper: Long Short-Term Memory

LSTM addressed the vanishing gradient problem in recurrent neural networks by introducing memory cells with gating mechanisms. The architecture uses input, output, and forget gates to control information flow, enabling networks to learn long-term dependencies. LSTM became the dominant architecture for sequence modeling tasks including speech recognition, machine translation, and time series prediction before the transformer era.

1998
1998
Convolutional Neural Networks (LeNet) at AT&T Bell Laboratories

Convolutional Neural Networks (LeNet)

AT&T Bell Laboratories

Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner

Impact 91 · 72,400 citations

Paper: Gradient-Based Learning Applied to Document Recognition

LeNet introduced a practical convolutional neural network architecture for document recognition. It combined convolutional layers for local feature extraction, pooling for spatial invariance, and fully connected layers for classification. This architecture demonstrated that CNNs could be trained end-to-end using backpropagation and achieved state-of-the-art results on handwritten digit recognition, establishing the blueprint for modern computer vision systems.

2003
2003
The Neural Probabilistic Language Model at Université de Montréal

The Neural Probabilistic Language Model

Université de Montréal

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin

Impact 76 · 18,500 citations

Paper: A Neural Probabilistic Language Model

The Neural Probabilistic Language Model addressed the curse of dimensionality in language modeling by learning distributed representations for words. It introduced the idea that similar words would have similar vector representations, allowing the model to generalize to unseen word sequences. This foundational work pioneered the use of neural networks for language modeling and word embeddings, directly inspiring Word2Vec and modern language models.

2009
2009
ImageNet Dataset at Princeton University

ImageNet Dataset

Princeton University

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei

Impact 90 · 58,300 citations

Paper: ImageNet: A Large-Scale Hierarchical Image Database

ImageNet created a large-scale dataset with over 14 million labeled images across thousands of categories, organized hierarchically using WordNet. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the premier benchmark for computer vision. ImageNet's scale and diversity enabled the training of deep neural networks and catalyzed the deep learning revolution, particularly with AlexNet's breakthrough in 2012.

2010
2010
Xavier/Glorot Initialization at Université de Montréal

Xavier/Glorot Initialization

Université de Montréal

Xavier Glorot, Yoshua Bengio

Impact 79 · 25,800 citations

Paper: Understanding the Difficulty of Training Deep Feedforward Neural Networks

Xavier initialization provided a principled method for initializing neural network weights to maintain consistent variance of activations and gradients across layers. By scaling initial weights based on the number of input and output connections, it prevented vanishing or exploding gradients during training. This simple but crucial technique enabled the training of much deeper networks and remains a standard practice in deep learning.

2010
2010
Rectified Linear Unit (ReLU) Activation at University of Toronto

Rectified Linear Unit (ReLU) Activation

University of Toronto

Vinod Nair, Geoffrey E. Hinton

Impact 63 · 6,800 citations

Paper: Rectified Linear Units Improve Restricted Boltzmann Machines

ReLU introduced a simple non-saturating activation function f(x) = max(0, x) that addressed the vanishing gradient problem of sigmoid and tanh activations. ReLU enabled faster training, reduced computational cost, and induced sparsity in neural networks. Despite its simplicity, ReLU became the default activation function for deep neural networks and enabled the training of much deeper architectures.

2012
2012
AlexNet at University of Toronto

AlexNet

University of Toronto

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Impact 96 · 150,801 citations

Paper: ImageNet Classification with Deep Convolutional Neural Networks

GitHub repo: cuda-convnet2

AlexNet won the ImageNet 2012 competition with a significant margin, demonstrating that deep convolutional networks trained with GPUs could dramatically outperform traditional computer vision methods. The architecture combined ReLU activations, dropout regularization, data augmentation, and GPU training. AlexNet's success marked the beginning of the deep learning era and sparked intense interest in neural networks across academia and industry.

2012
2012
Dropout at University of Toronto

Dropout

University of Toronto

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov

Impact 89 · 56,200 citations

Paper: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Dropout introduced a powerful regularization technique by randomly dropping units during training, preventing co-adaptation of features. This simple method significantly reduced overfitting in deep neural networks by training an ensemble of exponentially many sub-networks. Dropout became a standard regularization technique and enabled the training of larger networks without excessive overfitting.

2013
2013
Word2Vec at Google

Word2Vec

Google

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

Impact 87 · 45,200 citations

Paper: Efficient Estimation of Word Representations in Vector Space

GitHub repo: word2vec

Word2Vec introduced efficient methods (Skip-gram and CBOW) for learning dense vector representations of words from large corpora. These embeddings captured semantic and syntactic relationships, enabling vector arithmetic like 'king' - 'man' + 'woman' ≈ 'queen'. Word2Vec revolutionized natural language processing by providing a scalable way to represent words as continuous vectors, becoming foundational for modern NLP.

2013
2013
Variational Autoencoder (VAE) at University of Amsterdam

Variational Autoencoder (VAE)

University of Amsterdam

Diederik P. Kingma, Max Welling

Impact 83 · 32,100 citations

Paper: Auto-Encoding Variational Bayes

VAE introduced a probabilistic approach to learning latent representations by combining variational inference with neural networks. It learns a distribution over latent codes rather than deterministic encodings, enabling both efficient inference and generation. VAE provided a principled framework for generative modeling and became influential in unsupervised learning, representation learning, and generative AI.

2014
2014
Generative Adversarial Network (GAN) at Université de Montréal

Generative Adversarial Network (GAN)

Université de Montréal

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Impact 93 · 88,791 citations

Paper: Generative Adversarial Networks

GAN introduced a game-theoretic framework where a generator network learns to create realistic data by competing against a discriminator network. This adversarial training process enabled the generation of highly realistic images without requiring explicit modeling of probability distributions. GANs revolutionized generative modeling and spawned numerous applications in image synthesis, style transfer, and data augmentation.

2014
2014
Adam Optimizer at OpenAI, University of Toronto

Adam Optimizer

OpenAI, University of Toronto

Diederik P. Kingma, Jimmy Ba

Impact 99 · 215,000 citations

Paper: Adam: A Method for Stochastic Optimization

Adam combined the benefits of AdaGrad and RMSProp by computing adaptive learning rates for each parameter using estimates of first and second moments of gradients. It included bias correction terms and proved robust across a wide range of problems with minimal hyperparameter tuning. Adam became the most widely used optimizer in deep learning due to its efficiency, ease of use, and strong empirical performance.

2014
2014
Sequence-to-Sequence Learning at Google

Sequence-to-Sequence Learning

Google

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

Impact 80 · 26,800 citations

Paper: Sequence to Sequence Learning with Neural Networks

GitHub repo: seq2seq

Seq2Seq introduced an end-to-end framework for sequence transduction using an encoder-decoder architecture with LSTMs. The encoder maps variable-length input sequences to fixed-size representations, which the decoder transforms into variable-length output sequences. This architecture unified many NLP tasks under a single framework and achieved breakthrough results in machine translation, establishing neural approaches as state-of-the-art.

2014
2014
Attention Mechanism at Université de Montréal

Attention Mechanism

Université de Montréal

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Impact 88 · 47,200 citations

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau attention addressed the bottleneck in sequence-to-sequence models by allowing the decoder to focus on different parts of the input sequence at each decoding step. This attention mechanism computed context vectors as weighted sums of encoder hidden states, where weights were learned based on relevance. Attention became a fundamental building block of modern NLP systems and directly inspired the transformer architecture.

2014
2014
GloVe Word Embeddings at Stanford University

GloVe Word Embeddings

Stanford University

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Impact 86 · 40,500 citations

Paper: GloVe: Global Vectors for Word Representation

GitHub repo: GloVe

GloVe combined global matrix factorization with local context window methods for learning word embeddings. It trained on aggregated word-word co-occurrence statistics to produce vectors with meaningful linear substructures. GloVe provided an alternative to Word2Vec with strong performance on word analogy and similarity tasks, and its pre-trained vectors became widely used in NLP applications.

2014
2014
Neural Turing Machine at Google DeepMind

Neural Turing Machine

Google DeepMind

Alex Graves, Greg Wayne, Ivo Danihelka

Impact 56 · 3,850 citations

Paper: Neural Turing Machines

Neural Turing Machines extended neural networks by coupling them to external memory resources accessed through attention mechanisms. The entire system was differentiable end-to-end, allowing gradient-based training. NTMs demonstrated that neural networks could learn simple algorithms like copying, sorting, and associative recall from examples alone, showing that neural networks could exhibit more algorithmic and programmable behavior.

2015
2015
Batch Normalization at Google

Batch Normalization

Google

Sergey Ioffe, Christian Szegedy

Impact 92 · 72,400 citations

Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization normalized layer inputs across mini-batches, stabilizing training by reducing internal covariate shift. It enabled much higher learning rates, reduced sensitivity to initialization, and acted as a regularizer. Batch normalization dramatically accelerated training and became a standard component in deep networks, enabling the training of very deep architectures that were previously difficult to optimize.

2015
2015
Residual Networks (ResNet) at Microsoft Research

Residual Networks (ResNet)

Microsoft Research

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Impact 100 · 298,566 citations

Paper: Deep Residual Learning for Image Recognition

GitHub repo: deep-residual-networks

ResNet introduced skip connections that allowed gradients to flow directly through networks by learning residual mappings. This simple architectural change enabled the training of networks with hundreds or even thousands of layers without degradation problems. ResNet won ImageNet 2015 and demonstrated that very deep networks could be effectively trained, fundamentally changing how we design neural network architectures.

2016
2016
Layer Normalization at University of Toronto

Layer Normalization

University of Toronto

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

Impact 77 · 20,100 citations

Paper: Layer Normalization

Layer Normalization normalized inputs across features for each example independently, unlike batch normalization which normalized across the batch dimension. This made it particularly effective for recurrent neural networks and sequences of varying length. Layer normalization stabilized hidden state dynamics in RNNs and later became the standard normalization technique in transformer architectures.

2016
2016
Subword Units (BPE): Solving the Rare Word Problem at University of Edinburgh

Subword Units (BPE): Solving the Rare Word Problem

University of Edinburgh

Rico Sennrich, Barry Haddow, Alexandra Birch

Impact 71 · 14,800 citations

Paper: Neural Machine Translation of Rare Words with Subword Units

GitHub repo: subword-nmt

Byte-Pair Encoding (BPE) adapted a data compression algorithm for neural machine translation, enabling open-vocabulary learning by breaking words into subword units. This solved the rare word problem by representing infrequent words as sequences of common subwords. BPE became the standard tokenization approach for language models, enabling models to handle any word while maintaining reasonable vocabulary sizes, and is used in GPT, BERT, and most modern LLMs.

2017
2017
Transformer Architecture at Google

Transformer Architecture

Google

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Impact 98 · 209,982 citations

Paper: Attention Is All You Need

GitHub repo: tensor2tensor

The Transformer replaced recurrence and convolutions entirely with self-attention mechanisms, processing sequences in parallel rather than sequentially. It introduced multi-head attention, positional encodings, and a feedforward encoder-decoder structure. The Transformer achieved state-of-the-art translation results while being more parallelizable and requiring significantly less training time. This architecture became the foundation for modern large language models and revolutionized NLP.

2017
2017
Reinforcement Learning from Human Feedback (RLHF) at OpenAI, UC Berkeley, DeepMind

Reinforcement Learning from Human Feedback (RLHF)

OpenAI, UC Berkeley, DeepMind

Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Impact 55 · 3,250 citations

Paper: Deep Reinforcement Learning from Human Preferences

RLHF introduced a method for training RL agents using human preference comparisons rather than hand-crafted reward functions. Humans compared pairs of trajectory segments, and a reward model was trained to predict preferences. This reward model then guided policy optimization. RLHF scaled preference-based learning to complex tasks and later became crucial for aligning large language models with human values and intentions.

2017
2017
Sparsely-Gated Mixture of Experts at Google

Sparsely-Gated Mixture of Experts

Google

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Impact 50 · 3,050 citations

Paper: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Mixture of Experts introduced conditional computation where a gating network routes each input to a sparse subset of expert sub-networks. This enabled training models with orders of magnitude more parameters without proportional increases in computation. MoE demonstrated that model capacity could be dramatically increased through sparsity, achieving state-of-the-art results in language modeling and translation. This approach later influenced large-scale models like GPT-4.

2017
2017
Proximal Policy Optimization (PPO) at OpenAI

Proximal Policy Optimization (PPO)

OpenAI

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

Impact 72 · 16,800 citations

Paper: Proximal Policy Optimization Algorithms

GitHub repo: baselines

PPO introduced a simpler and more stable policy gradient method by clipping the objective function to prevent excessively large policy updates. It combined the benefits of trust region methods with the simplicity of first-order optimization. PPO became the most widely used reinforcement learning algorithm due to its robustness, ease of implementation, and strong empirical performance across diverse tasks.

2018
2018
ELMo (Embeddings from Language Models) at Allen Institute for AI, University of Washington

ELMo (Embeddings from Language Models)

Allen Institute for AI, University of Washington

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Impact 74 · 17,800 citations

Paper: Deep Contextualized Word Representations

GitHub repo: bilm-tf

ELMo generated context-dependent word representations by using bidirectional LSTMs trained as language models. Unlike static embeddings, ELMo representations varied based on context, capturing polysemy and complex linguistic features. ELMo demonstrated the power of pre-training and fine-tuning, significantly improving performance across diverse NLP tasks. It was a crucial step toward modern contextualized language models and transfer learning in NLP.

2018
2018
GPT (Generative Pre-Training) at OpenAI

GPT (Generative Pre-Training)

OpenAI

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

Impact 59 · 6,100 citations

Paper: Improving Language Understanding by Generative Pre-Training

GitHub repo: finetune-transformer-lm

GPT introduced a two-stage approach: unsupervised pre-training of a transformer language model on large text corpora, followed by supervised fine-tuning on specific tasks. This demonstrated that language models could learn general representations useful across many tasks. GPT showed that pre-training could significantly reduce the labeled data required for downstream tasks, establishing the pre-train-then-fine-tune paradigm that dominated subsequent NLP research.

2018
2018
BERT (Bidirectional Encoder Representations from Transformers) at Google

BERT (Bidirectional Encoder Representations from Transformers)

Google

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Impact 97 · 152,370 citations

Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

GitHub repo: bert

BERT pre-trained bidirectional transformers using masked language modeling and next sentence prediction. Unlike previous unidirectional models, BERT jointly conditioned on both left and right context in all layers. BERT achieved state-of-the-art results across eleven NLP tasks and demonstrated that deeply bidirectional pre-training was crucial for language understanding. BERT became the foundation for numerous downstream applications and variants.

2018
2018
Mixed Precision Training at NVIDIA

Mixed Precision Training

NVIDIA

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Olaf Klauser, Andrew Kraljevic, Chris Paine, Naveen Satish, Michael Wu

Impact 57 · 4,700 citations

Paper: Mixed Precision Training

GitHub repo: apex

Micikevicius et al. showed how to safely train deep networks using half-precision (FP16) arithmetic while preserving full-precision accuracy. By keeping FP32 master weights, accumulating gradients in FP32, and using loss scaling to avoid underflow, they demonstrated 2–3× speedups on NVIDIA Tensor Cores without sacrificing convergence. Mixed precision became the standard recipe for large-scale transformer training, enabling today's models to fit within GPU memory budgets.

2019
2019
GPT-2 at OpenAI

GPT-2

OpenAI

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

Impact 69 · 12,400 citations

Paper: Language Models are Unsupervised Multitask Learners

GitHub repo: gpt-2

GPT-2 scaled up the original GPT to 1.5 billion parameters and trained on a larger, more diverse dataset. It demonstrated that language models could perform many tasks zero-shot without fine-tuning by simply conditioning on appropriate prompts. GPT-2 showed strong performance on diverse tasks including translation, summarization, and question answering, suggesting that with sufficient scale and data, language models naturally learn multitask capabilities.

2019
2019
T5 (Text-to-Text Transfer Transformer) at Google

T5 (Text-to-Text Transfer Transformer)

Google

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Impact 78 · 21,500 citations

Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

GitHub repo: text-to-text-transfer-transformer

T5 unified all NLP tasks into a text-to-text format where both inputs and outputs are text strings. It systematically explored transfer learning techniques including pre-training objectives, architectures, datasets, and fine-tuning methods. T5's encoder-decoder architecture and comprehensive evaluation provided insights into what makes transfer learning effective. The unified framework simplified multi-task learning and became influential for instruction-following models.

2020
2020
Scaling Laws for Neural Language Models at OpenAI

Scaling Laws for Neural Language Models

OpenAI

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei

Impact 58 · 5,400 citations

Paper: Scaling Laws for Neural Language Models

This work empirically demonstrated that language model performance scales as power-laws with model size, dataset size, and compute budget. The research showed predictable relationships between these factors and suggested optimal allocation strategies. These scaling laws provided quantitative guidance for training large models and predicted that simply scaling up models would continue to yield improvements, influencing subsequent investment in large-scale model development.

2020
2020
GPT-3 at OpenAI

GPT-3

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Impact 82 · 30,200 citations

Paper: Language Models are Few-Shot Learners

GPT-3 scaled transformers to 175 billion parameters, demonstrating that language models could perform diverse tasks with few-shot, one-shot, or zero-shot learning from prompts alone. It showed impressive performance on translation, question-answering, arithmetic, and novel word usage without gradient updates. GPT-3 revealed that with sufficient scale, language models develop broad capabilities and sparked widespread interest in large language models and prompt engineering.

2020
2020

Retrieval-Augmented Generation (RAG)

Facebook AI

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

Impact 65 · 7,900 citations

Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

RAG combined parametric language models with non-parametric document retrieval, allowing generation to be grounded in an external knowledge index rather than only in model weights. This substantially improved factual QA and made it practical to update a system's knowledge without full retraining. The paper became the conceptual template for modern retrieval-augmented LLM systems.

2020
2020
ZeRO (Zero Redundancy Optimizer) at Microsoft

ZeRO (Zero Redundancy Optimizer)

Microsoft

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

Impact 54 · 3,200 citations

Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

GitHub repo: DeepSpeed

ZeRO eliminated memory redundancies in data-parallel distributed training by partitioning optimizer states, gradients, and parameters across devices rather than replicating them. ZeRO enabled training models with trillions of parameters by dramatically reducing per-device memory requirements while maintaining computational efficiency. This optimization became crucial for training large language models and is implemented in DeepSpeed, enabling the scale of models like GPT-3 and beyond.

2021
2021
RoFormer: Rotary Position Embedding (RoPE) at Zhuiyi Technology

RoFormer: Rotary Position Embedding (RoPE)

Zhuiyi Technology

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu

Impact 50 · 2,800 citations

Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding

GitHub repo: roformer

Rotary Position Embedding (RoPE) encodes position information by rotating word embeddings based on their absolute positions, while naturally encoding relative position information through the rotation properties. RoPE provided better extrapolation to longer sequences than previous position encoding methods while being computationally efficient. It was adopted by influential models including PaLM, LLaMA, and many other modern LLMs, becoming a preferred position encoding technique.

2021
2021
CLIP at OpenAI

CLIP

OpenAI

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Impact 86 · 39,800 citations

Paper: Learning Transferable Visual Models From Natural Language Supervision

GitHub repo: CLIP

CLIP trained image and text encoders contrastively on large-scale web image-caption pairs, showing that natural language supervision could produce zero-shot transferable visual representations. It was a major step toward modern multimodal foundation models and strongly influenced vision-language pretraining, retrieval, and image generation workflows. CLIP also helped popularize prompt-based evaluation outside pure NLP.

2021
2021
LoRA: Low-Rank Adaptation of Large Language Models at Microsoft

LoRA: Low-Rank Adaptation of Large Language Models

Microsoft

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

Impact 67 · 8,900 citations

Paper: LoRA: Low-Rank Adaptation of Large Language Models

GitHub repo: LoRA

LoRA enabled efficient fine-tuning of large language models by training low-rank decomposition matrices that are added to frozen pre-trained weights. This reduced trainable parameters by 10,000x and memory requirements by 3x while maintaining or exceeding full fine-tuning performance. LoRA made it practical to customize large models for specific tasks with limited compute resources, democratizing access to fine-tuning and enabling rapid adaptation of foundation models.

2022
2022
Chinchilla Scaling Laws at DeepMind

Chinchilla Scaling Laws

DeepMind

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

Impact 70 · 9,900 citations

Paper: Training Compute-Optimal Large Language Models

Chinchilla revised the original scaling-laws story by showing that many frontier models were undertrained relative to their parameter count and that, for a fixed compute budget, smaller models trained on substantially more tokens perform better. This changed how practitioners think about optimal model/data allocation and heavily influenced subsequent LLM training recipes. It became one of the clearest examples of scaling-law results directly changing engineering strategy.

2022
2022

FLAN Instruction Tuning

Google Research

Hyung Won Chung, Le Hou, Shixiang Gu, Ankit Doshi, Yi Tay, Vinh Q. Tran, John Q. Wei, Adi Lester, Noam Shazeer, Hongkun Yu, Fei Wang, Sharan Narang, Liam Fedus, Zihang Dai, Oriol Vinyals, Denny Zhou, Quoc V. Le, Jason Wei

Impact 69 · 9,100 citations

Paper: Finetuned Language Models Are Zero-Shot Learners

GitHub repo: FLAN

FLAN showed that instruction tuning across a broad mixture of tasks can dramatically improve zero-shot and few-shot generalization, making pretrained models much better at following natural-language instructions without task-specific finetuning. It helped establish instruction tuning as a core post-training step for useful LLM assistants. Much of the later chat-assistant paradigm builds on this lesson.

2022
2022
InstructGPT at OpenAI

InstructGPT

OpenAI

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Impact 66 · 7,200 citations

Paper: Training Language Models to Follow Instructions with Human Feedback

InstructGPT fine-tuned GPT-3 using supervised learning on human-written demonstrations followed by reinforcement learning from human feedback. Despite having 100x fewer parameters, InstructGPT outputs were preferred to GPT-3 outputs. The model showed improvements in truthfulness, helpfulness, and reduced toxicity. InstructGPT demonstrated that alignment with human preferences through RLHF was crucial for making language models useful and safe, establishing the approach used in ChatGPT.

2022
2022
Chain-of-Thought Prompting at Google Research

Chain-of-Thought Prompting

Google Research

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Impact 63 · 6,700 citations

Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-thought prompting enabled language models to solve complex reasoning tasks by generating intermediate reasoning steps before arriving at final answers. Simply adding a few examples with reasoning chains dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. This technique revealed emergent reasoning capabilities in large models and demonstrated that prompting strategies could unlock latent abilities without additional training.

2022
2022
FlashAttention at Stanford University

FlashAttention

Stanford University

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Impact 54 · 3,100 citations

Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

GitHub repo: flash-attention

FlashAttention optimized the attention mechanism by accounting for GPU memory hierarchy, using tiling to reduce data movement between GPU memory levels. This IO-aware algorithm achieved exact attention with significantly reduced memory usage and 2-4x speedup compared to standard implementations. FlashAttention enabled training transformers with much longer context lengths and became widely adopted, fundamentally improving the efficiency of transformer models.

2022
2022
Constitutional AI: Harmlessness from AI Feedback at Anthropic

Constitutional AI: Harmlessness from AI Feedback

Anthropic

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan

Impact 47 · 1,850 citations

Paper: Constitutional AI: Harmlessness from AI Feedback

Constitutional AI introduced a method for training harmless AI assistants using AI-generated feedback based on a set of principles (a 'constitution') rather than relying solely on human feedback. The model critiques and revises its own responses according to constitutional principles, then learns from these self-improvements. This approach reduced reliance on human labelers for harmlessness training while making the values guiding AI behavior more transparent and debuggable.

2023
2023

LLaMA

Meta AI

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

Impact 78 · 18,300 citations

Paper: LLaMA: Open and Efficient Foundation Language Models

LLaMA showed that carefully trained smaller foundation models could compete strongly with much larger systems, and its release catalyzed the open-weight LLM ecosystem. It accelerated research on fine-tuning, alignment, evaluation, and local deployment by giving the community a strong accessible base model family. In practice, it was a major inflection point for open LLM development.

2023
2023
Direct Preference Optimization (DPO) at Stanford University

Direct Preference Optimization (DPO)

Stanford University

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Impact 51 · 2,600 citations

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

GitHub repo: direct-preference-optimization

DPO simplified preference learning by directly optimizing language models on human preferences without requiring a separate reward model or reinforcement learning. It reformulated RLHF as a classification problem over preference pairs, making training more stable and efficient. DPO achieved comparable or better results than RLHF while being simpler to implement and tune, becoming a popular alternative for aligning language models with human preferences.

2023
2023
QLoRA: Efficient Fine-Tuning of Quantized LLMs at University of Washington

QLoRA: Efficient Fine-Tuning of Quantized LLMs

University of Washington

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Impact 56 · 3,100 citations

Paper: QLoRA: Efficient Finetuning of Quantized LLMs

GitHub repo: qlora

QLoRA combined quantization with LoRA to enable fine-tuning of extremely large models on consumer hardware. It quantized the base model to 4-bit precision while using LoRA adapters in higher precision, maintaining full fine-tuning performance. QLoRA made it possible to fine-tune a 65B parameter model on a single GPU with 48GB memory, dramatically democratizing access to fine-tuning large language models and enabling researchers with limited resources to customize state-of-the-art models.