History of Machine Learning & LLMs
Disclaimer: I used various LLMs to generate some of the data for this timeline.
Impact score is a citation-driven editorial score. It uses relative citation rank, adds a small recency bonus, and the default cutoff hides a few more provisional entries.
Perceptron
Cornell Aeronautical Laboratory
Impact 47 · 2,525 citations
Paper: The Perceptron: A Perceiving and Recognizing Automaton
The Perceptron was the first model that could learn the weights defining categories given examples from each category. It established the foundation for artificial neural networks by introducing a learning algorithm that could automatically adjust connection weights. The perceptron demonstrated that machines could learn from experience, marking a fundamental breakthrough in machine learning.
Neocognitron
NHK Science & Technical Research Laboratories
Impact 60 · 6,200 citations
The Neocognitron was a hierarchical, multilayered neural network inspired by the visual cortex. It introduced the concepts of S-cells (simple cells) and C-cells (complex cells) arranged in a hierarchy, allowing for position-invariant pattern recognition. This architecture laid the groundwork for modern convolutional neural networks and demonstrated that local feature extraction combined with spatial pooling could achieve robust visual recognition.
Backpropagation
University of California San Diego, Carnegie Mellon University, University of Toronto
Impact 81 · 29,607 citations
Paper: Learning Representations by Back-propagating Errors
Backpropagation provided an efficient method for training multi-layer neural networks by computing gradients through the chain rule. This algorithm enabled the training of deep networks by propagating error signals backwards through layers, allowing hidden units to learn internal representations. Backpropagation became the workhorse of neural network training and remains fundamental to modern deep learning.

Long Short-Term Memory (LSTM)
Technische Universität München
Impact 94 · 139,920 citations
Paper: Long Short-Term Memory
LSTM addressed the vanishing gradient problem in recurrent neural networks by introducing memory cells with gating mechanisms. The architecture uses input, output, and forget gates to control information flow, enabling networks to learn long-term dependencies. LSTM became the dominant architecture for sequence modeling tasks including speech recognition, machine translation, and time series prediction before the transformer era.
Convolutional Neural Networks (LeNet)
AT&T Bell Laboratories
Impact 91 · 72,400 citations
Paper: Gradient-Based Learning Applied to Document Recognition
LeNet introduced a practical convolutional neural network architecture for document recognition. It combined convolutional layers for local feature extraction, pooling for spatial invariance, and fully connected layers for classification. This architecture demonstrated that CNNs could be trained end-to-end using backpropagation and achieved state-of-the-art results on handwritten digit recognition, establishing the blueprint for modern computer vision systems.

The Neural Probabilistic Language Model
Université de Montréal
Impact 76 · 18,500 citations
Paper: A Neural Probabilistic Language Model
The Neural Probabilistic Language Model addressed the curse of dimensionality in language modeling by learning distributed representations for words. It introduced the idea that similar words would have similar vector representations, allowing the model to generalize to unseen word sequences. This foundational work pioneered the use of neural networks for language modeling and word embeddings, directly inspiring Word2Vec and modern language models.

ImageNet Dataset
Princeton University
Impact 90 · 58,300 citations
Paper: ImageNet: A Large-Scale Hierarchical Image Database
ImageNet created a large-scale dataset with over 14 million labeled images across thousands of categories, organized hierarchically using WordNet. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the premier benchmark for computer vision. ImageNet's scale and diversity enabled the training of deep neural networks and catalyzed the deep learning revolution, particularly with AlexNet's breakthrough in 2012.

Xavier/Glorot Initialization
Université de Montréal
Impact 79 · 25,800 citations
Paper: Understanding the Difficulty of Training Deep Feedforward Neural Networks
Xavier initialization provided a principled method for initializing neural network weights to maintain consistent variance of activations and gradients across layers. By scaling initial weights based on the number of input and output connections, it prevented vanishing or exploding gradients during training. This simple but crucial technique enabled the training of much deeper networks and remains a standard practice in deep learning.
Rectified Linear Unit (ReLU) Activation
University of Toronto
Impact 63 · 6,800 citations
Paper: Rectified Linear Units Improve Restricted Boltzmann Machines
ReLU introduced a simple non-saturating activation function f(x) = max(0, x) that addressed the vanishing gradient problem of sigmoid and tanh activations. ReLU enabled faster training, reduced computational cost, and induced sparsity in neural networks. Despite its simplicity, ReLU became the default activation function for deep neural networks and enabled the training of much deeper architectures.
AlexNet
University of Toronto
Impact 96 · 150,801 citations
Paper: ImageNet Classification with Deep Convolutional Neural Networks
GitHub repo: cuda-convnet2
AlexNet won the ImageNet 2012 competition with a significant margin, demonstrating that deep convolutional networks trained with GPUs could dramatically outperform traditional computer vision methods. The architecture combined ReLU activations, dropout regularization, data augmentation, and GPU training. AlexNet's success marked the beginning of the deep learning era and sparked intense interest in neural networks across academia and industry.
Dropout
University of Toronto
Impact 89 · 56,200 citations
Paper: Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Dropout introduced a powerful regularization technique by randomly dropping units during training, preventing co-adaptation of features. This simple method significantly reduced overfitting in deep neural networks by training an ensemble of exponentially many sub-networks. Dropout became a standard regularization technique and enabled the training of larger networks without excessive overfitting.

Word2Vec
Impact 87 · 45,200 citations
Paper: Efficient Estimation of Word Representations in Vector Space
GitHub repo: word2vec
Word2Vec introduced efficient methods (Skip-gram and CBOW) for learning dense vector representations of words from large corpora. These embeddings captured semantic and syntactic relationships, enabling vector arithmetic like 'king' - 'man' + 'woman' ≈ 'queen'. Word2Vec revolutionized natural language processing by providing a scalable way to represent words as continuous vectors, becoming foundational for modern NLP.
Variational Autoencoder (VAE)
University of Amsterdam
Impact 83 · 32,100 citations
Paper: Auto-Encoding Variational Bayes
VAE introduced a probabilistic approach to learning latent representations by combining variational inference with neural networks. It learns a distribution over latent codes rather than deterministic encodings, enabling both efficient inference and generation. VAE provided a principled framework for generative modeling and became influential in unsupervised learning, representation learning, and generative AI.

Generative Adversarial Network (GAN)
Université de Montréal
Impact 93 · 88,791 citations
Paper: Generative Adversarial Networks
GAN introduced a game-theoretic framework where a generator network learns to create realistic data by competing against a discriminator network. This adversarial training process enabled the generation of highly realistic images without requiring explicit modeling of probability distributions. GANs revolutionized generative modeling and spawned numerous applications in image synthesis, style transfer, and data augmentation.
Adam Optimizer
OpenAI, University of Toronto
Impact 99 · 215,000 citations
Paper: Adam: A Method for Stochastic Optimization
Adam combined the benefits of AdaGrad and RMSProp by computing adaptive learning rates for each parameter using estimates of first and second moments of gradients. It included bias correction terms and proved robust across a wide range of problems with minimal hyperparameter tuning. Adam became the most widely used optimizer in deep learning due to its efficiency, ease of use, and strong empirical performance.

Sequence-to-Sequence Learning
Impact 80 · 26,800 citations
Paper: Sequence to Sequence Learning with Neural Networks
GitHub repo: seq2seq
Seq2Seq introduced an end-to-end framework for sequence transduction using an encoder-decoder architecture with LSTMs. The encoder maps variable-length input sequences to fixed-size representations, which the decoder transforms into variable-length output sequences. This architecture unified many NLP tasks under a single framework and achieved breakthrough results in machine translation, establishing neural approaches as state-of-the-art.

Attention Mechanism
Université de Montréal
Impact 88 · 47,200 citations
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau attention addressed the bottleneck in sequence-to-sequence models by allowing the decoder to focus on different parts of the input sequence at each decoding step. This attention mechanism computed context vectors as weighted sums of encoder hidden states, where weights were learned based on relevance. Attention became a fundamental building block of modern NLP systems and directly inspired the transformer architecture.

GloVe Word Embeddings
Stanford University
Impact 86 · 40,500 citations
Paper: GloVe: Global Vectors for Word Representation
GitHub repo: GloVe
GloVe combined global matrix factorization with local context window methods for learning word embeddings. It trained on aggregated word-word co-occurrence statistics to produce vectors with meaningful linear substructures. GloVe provided an alternative to Word2Vec with strong performance on word analogy and similarity tasks, and its pre-trained vectors became widely used in NLP applications.
Neural Turing Machine
Google DeepMind
Impact 56 · 3,850 citations
Paper: Neural Turing Machines
Neural Turing Machines extended neural networks by coupling them to external memory resources accessed through attention mechanisms. The entire system was differentiable end-to-end, allowing gradient-based training. NTMs demonstrated that neural networks could learn simple algorithms like copying, sorting, and associative recall from examples alone, showing that neural networks could exhibit more algorithmic and programmable behavior.

Batch Normalization
Impact 92 · 72,400 citations
Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Batch Normalization normalized layer inputs across mini-batches, stabilizing training by reducing internal covariate shift. It enabled much higher learning rates, reduced sensitivity to initialization, and acted as a regularizer. Batch normalization dramatically accelerated training and became a standard component in deep networks, enabling the training of very deep architectures that were previously difficult to optimize.

Residual Networks (ResNet)
Microsoft Research
Impact 100 · 298,566 citations
Paper: Deep Residual Learning for Image Recognition
GitHub repo: deep-residual-networks
ResNet introduced skip connections that allowed gradients to flow directly through networks by learning residual mappings. This simple architectural change enabled the training of networks with hundreds or even thousands of layers without degradation problems. ResNet won ImageNet 2015 and demonstrated that very deep networks could be effectively trained, fundamentally changing how we design neural network architectures.
Layer Normalization
University of Toronto
Impact 77 · 20,100 citations
Paper: Layer Normalization
Layer Normalization normalized inputs across features for each example independently, unlike batch normalization which normalized across the batch dimension. This made it particularly effective for recurrent neural networks and sequences of varying length. Layer normalization stabilized hidden state dynamics in RNNs and later became the standard normalization technique in transformer architectures.

Subword Units (BPE): Solving the Rare Word Problem
University of Edinburgh
Impact 71 · 14,800 citations
Paper: Neural Machine Translation of Rare Words with Subword Units
GitHub repo: subword-nmt
Byte-Pair Encoding (BPE) adapted a data compression algorithm for neural machine translation, enabling open-vocabulary learning by breaking words into subword units. This solved the rare word problem by representing infrequent words as sequences of common subwords. BPE became the standard tokenization approach for language models, enabling models to handle any word while maintaining reasonable vocabulary sizes, and is used in GPT, BERT, and most modern LLMs.

Transformer Architecture
Impact 98 · 209,982 citations
Paper: Attention Is All You Need
GitHub repo: tensor2tensor
The Transformer replaced recurrence and convolutions entirely with self-attention mechanisms, processing sequences in parallel rather than sequentially. It introduced multi-head attention, positional encodings, and a feedforward encoder-decoder structure. The Transformer achieved state-of-the-art translation results while being more parallelizable and requiring significantly less training time. This architecture became the foundation for modern large language models and revolutionized NLP.
Reinforcement Learning from Human Feedback (RLHF)
OpenAI, UC Berkeley, DeepMind
Impact 55 · 3,250 citations
Paper: Deep Reinforcement Learning from Human Preferences
RLHF introduced a method for training RL agents using human preference comparisons rather than hand-crafted reward functions. Humans compared pairs of trajectory segments, and a reward model was trained to predict preferences. This reward model then guided policy optimization. RLHF scaled preference-based learning to complex tasks and later became crucial for aligning large language models with human values and intentions.

Sparsely-Gated Mixture of Experts
Impact 50 · 3,050 citations
Paper: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Mixture of Experts introduced conditional computation where a gating network routes each input to a sparse subset of expert sub-networks. This enabled training models with orders of magnitude more parameters without proportional increases in computation. MoE demonstrated that model capacity could be dramatically increased through sparsity, achieving state-of-the-art results in language modeling and translation. This approach later influenced large-scale models like GPT-4.
Proximal Policy Optimization (PPO)
OpenAI
Impact 72 · 16,800 citations
Paper: Proximal Policy Optimization Algorithms
GitHub repo: baselines
PPO introduced a simpler and more stable policy gradient method by clipping the objective function to prevent excessively large policy updates. It combined the benefits of trust region methods with the simplicity of first-order optimization. PPO became the most widely used reinforcement learning algorithm due to its robustness, ease of implementation, and strong empirical performance across diverse tasks.
ELMo (Embeddings from Language Models)
Allen Institute for AI, University of Washington
Impact 74 · 17,800 citations
Paper: Deep Contextualized Word Representations
GitHub repo: bilm-tf
ELMo generated context-dependent word representations by using bidirectional LSTMs trained as language models. Unlike static embeddings, ELMo representations varied based on context, capturing polysemy and complex linguistic features. ELMo demonstrated the power of pre-training and fine-tuning, significantly improving performance across diverse NLP tasks. It was a crucial step toward modern contextualized language models and transfer learning in NLP.
GPT (Generative Pre-Training)
OpenAI
Impact 59 · 6,100 citations
Paper: Improving Language Understanding by Generative Pre-Training
GitHub repo: finetune-transformer-lm
GPT introduced a two-stage approach: unsupervised pre-training of a transformer language model on large text corpora, followed by supervised fine-tuning on specific tasks. This demonstrated that language models could learn general representations useful across many tasks. GPT showed that pre-training could significantly reduce the labeled data required for downstream tasks, establishing the pre-train-then-fine-tune paradigm that dominated subsequent NLP research.

BERT (Bidirectional Encoder Representations from Transformers)
Impact 97 · 152,370 citations
Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GitHub repo: bert
BERT pre-trained bidirectional transformers using masked language modeling and next sentence prediction. Unlike previous unidirectional models, BERT jointly conditioned on both left and right context in all layers. BERT achieved state-of-the-art results across eleven NLP tasks and demonstrated that deeply bidirectional pre-training was crucial for language understanding. BERT became the foundation for numerous downstream applications and variants.

Mixed Precision Training
NVIDIA
Impact 57 · 4,700 citations
Paper: Mixed Precision Training
GitHub repo: apex
Micikevicius et al. showed how to safely train deep networks using half-precision (FP16) arithmetic while preserving full-precision accuracy. By keeping FP32 master weights, accumulating gradients in FP32, and using loss scaling to avoid underflow, they demonstrated 2–3× speedups on NVIDIA Tensor Cores without sacrificing convergence. Mixed precision became the standard recipe for large-scale transformer training, enabling today's models to fit within GPU memory budgets.
GPT-2
OpenAI
Impact 69 · 12,400 citations
Paper: Language Models are Unsupervised Multitask Learners
GitHub repo: gpt-2
GPT-2 scaled up the original GPT to 1.5 billion parameters and trained on a larger, more diverse dataset. It demonstrated that language models could perform many tasks zero-shot without fine-tuning by simply conditioning on appropriate prompts. GPT-2 showed strong performance on diverse tasks including translation, summarization, and question answering, suggesting that with sufficient scale and data, language models naturally learn multitask capabilities.

T5 (Text-to-Text Transfer Transformer)
Impact 78 · 21,500 citations
Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
GitHub repo: text-to-text-transfer-transformer
T5 unified all NLP tasks into a text-to-text format where both inputs and outputs are text strings. It systematically explored transfer learning techniques including pre-training objectives, architectures, datasets, and fine-tuning methods. T5's encoder-decoder architecture and comprehensive evaluation provided insights into what makes transfer learning effective. The unified framework simplified multi-task learning and became influential for instruction-following models.
Scaling Laws for Neural Language Models
OpenAI
Impact 58 · 5,400 citations
Paper: Scaling Laws for Neural Language Models
This work empirically demonstrated that language model performance scales as power-laws with model size, dataset size, and compute budget. The research showed predictable relationships between these factors and suggested optimal allocation strategies. These scaling laws provided quantitative guidance for training large models and predicted that simply scaling up models would continue to yield improvements, influencing subsequent investment in large-scale model development.
GPT-3
OpenAI
Impact 82 · 30,200 citations
Paper: Language Models are Few-Shot Learners
GPT-3 scaled transformers to 175 billion parameters, demonstrating that language models could perform diverse tasks with few-shot, one-shot, or zero-shot learning from prompts alone. It showed impressive performance on translation, question-answering, arithmetic, and novel word usage without gradient updates. GPT-3 revealed that with sufficient scale, language models develop broad capabilities and sparked widespread interest in large language models and prompt engineering.
Retrieval-Augmented Generation (RAG)
Facebook AI
Impact 65 · 7,900 citations
Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG combined parametric language models with non-parametric document retrieval, allowing generation to be grounded in an external knowledge index rather than only in model weights. This substantially improved factual QA and made it practical to update a system's knowledge without full retraining. The paper became the conceptual template for modern retrieval-augmented LLM systems.

ZeRO (Zero Redundancy Optimizer)
Microsoft
Impact 54 · 3,200 citations
Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
GitHub repo: DeepSpeed
ZeRO eliminated memory redundancies in data-parallel distributed training by partitioning optimizer states, gradients, and parameters across devices rather than replicating them. ZeRO enabled training models with trillions of parameters by dramatically reducing per-device memory requirements while maintaining computational efficiency. This optimization became crucial for training large language models and is implemented in DeepSpeed, enabling the scale of models like GPT-3 and beyond.

RoFormer: Rotary Position Embedding (RoPE)
Zhuiyi Technology
Impact 50 · 2,800 citations
Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding
GitHub repo: roformer
Rotary Position Embedding (RoPE) encodes position information by rotating word embeddings based on their absolute positions, while naturally encoding relative position information through the rotation properties. RoPE provided better extrapolation to longer sequences than previous position encoding methods while being computationally efficient. It was adopted by influential models including PaLM, LLaMA, and many other modern LLMs, becoming a preferred position encoding technique.
CLIP
OpenAI
Impact 86 · 39,800 citations
Paper: Learning Transferable Visual Models From Natural Language Supervision
GitHub repo: CLIP
CLIP trained image and text encoders contrastively on large-scale web image-caption pairs, showing that natural language supervision could produce zero-shot transferable visual representations. It was a major step toward modern multimodal foundation models and strongly influenced vision-language pretraining, retrieval, and image generation workflows. CLIP also helped popularize prompt-based evaluation outside pure NLP.

LoRA: Low-Rank Adaptation of Large Language Models
Microsoft
Impact 67 · 8,900 citations
Paper: LoRA: Low-Rank Adaptation of Large Language Models
GitHub repo: LoRA
LoRA enabled efficient fine-tuning of large language models by training low-rank decomposition matrices that are added to frozen pre-trained weights. This reduced trainable parameters by 10,000x and memory requirements by 3x while maintaining or exceeding full fine-tuning performance. LoRA made it practical to customize large models for specific tasks with limited compute resources, democratizing access to fine-tuning and enabling rapid adaptation of foundation models.
Chinchilla Scaling Laws
DeepMind
Impact 70 · 9,900 citations
Paper: Training Compute-Optimal Large Language Models
Chinchilla revised the original scaling-laws story by showing that many frontier models were undertrained relative to their parameter count and that, for a fixed compute budget, smaller models trained on substantially more tokens perform better. This changed how practitioners think about optimal model/data allocation and heavily influenced subsequent LLM training recipes. It became one of the clearest examples of scaling-law results directly changing engineering strategy.
FLAN Instruction Tuning
Google Research
Impact 69 · 9,100 citations
Paper: Finetuned Language Models Are Zero-Shot Learners
GitHub repo: FLAN
FLAN showed that instruction tuning across a broad mixture of tasks can dramatically improve zero-shot and few-shot generalization, making pretrained models much better at following natural-language instructions without task-specific finetuning. It helped establish instruction tuning as a core post-training step for useful LLM assistants. Much of the later chat-assistant paradigm builds on this lesson.
InstructGPT
OpenAI
Impact 66 · 7,200 citations
Paper: Training Language Models to Follow Instructions with Human Feedback
InstructGPT fine-tuned GPT-3 using supervised learning on human-written demonstrations followed by reinforcement learning from human feedback. Despite having 100x fewer parameters, InstructGPT outputs were preferred to GPT-3 outputs. The model showed improvements in truthfulness, helpfulness, and reduced toxicity. InstructGPT demonstrated that alignment with human preferences through RLHF was crucial for making language models useful and safe, establishing the approach used in ChatGPT.

Chain-of-Thought Prompting
Google Research
Impact 63 · 6,700 citations
Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting enabled language models to solve complex reasoning tasks by generating intermediate reasoning steps before arriving at final answers. Simply adding a few examples with reasoning chains dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. This technique revealed emergent reasoning capabilities in large models and demonstrated that prompting strategies could unlock latent abilities without additional training.

FlashAttention
Stanford University
Impact 54 · 3,100 citations
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
GitHub repo: flash-attention
FlashAttention optimized the attention mechanism by accounting for GPU memory hierarchy, using tiling to reduce data movement between GPU memory levels. This IO-aware algorithm achieved exact attention with significantly reduced memory usage and 2-4x speedup compared to standard implementations. FlashAttention enabled training transformers with much longer context lengths and became widely adopted, fundamentally improving the efficiency of transformer models.
Constitutional AI: Harmlessness from AI Feedback
Anthropic
Impact 47 · 1,850 citations
Paper: Constitutional AI: Harmlessness from AI Feedback
Constitutional AI introduced a method for training harmless AI assistants using AI-generated feedback based on a set of principles (a 'constitution') rather than relying solely on human feedback. The model critiques and revises its own responses according to constitutional principles, then learns from these self-improvements. This approach reduced reliance on human labelers for harmlessness training while making the values guiding AI behavior more transparent and debuggable.
LLaMA
Meta AI
Impact 78 · 18,300 citations
Paper: LLaMA: Open and Efficient Foundation Language Models
LLaMA showed that carefully trained smaller foundation models could compete strongly with much larger systems, and its release catalyzed the open-weight LLM ecosystem. It accelerated research on fine-tuning, alignment, evaluation, and local deployment by giving the community a strong accessible base model family. In practice, it was a major inflection point for open LLM development.

Direct Preference Optimization (DPO)
Stanford University
Impact 51 · 2,600 citations
Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
GitHub repo: direct-preference-optimization
DPO simplified preference learning by directly optimizing language models on human preferences without requiring a separate reward model or reinforcement learning. It reformulated RLHF as a classification problem over preference pairs, making training more stable and efficient. DPO achieved comparable or better results than RLHF while being simpler to implement and tune, becoming a popular alternative for aligning language models with human preferences.

QLoRA: Efficient Fine-Tuning of Quantized LLMs
University of Washington
Impact 56 · 3,100 citations
Paper: QLoRA: Efficient Finetuning of Quantized LLMs
GitHub repo: qlora
QLoRA combined quantization with LoRA to enable fine-tuning of extremely large models on consumer hardware. It quantized the base model to 4-bit precision while using LoRA adapters in higher precision, maintaining full fine-tuning performance. QLoRA made it possible to fine-tune a 65B parameter model on a single GPU with 48GB memory, dramatically democratizing access to fine-tuning large language models and enabling researchers with limited resources to customize state-of-the-art models.