History of Machine Learning & LLMs
Disclaimer: I used various LLMs to generate some of the data for this timeline.
Perceptron
Cornell Aeronautical Laboratory
Paper: The Perceptron: A Perceiving and Recognizing Automaton
The Perceptron was the first model that could learn the weights defining categories given examples from each category. It established the foundation for artificial neural networks by introducing a learning algorithm that could automatically adjust connection weights. The perceptron demonstrated that machines could learn from experience, marking a fundamental breakthrough in machine learning.
Neocognitron
NHK Science & Technical Research Laboratories
The Neocognitron was a hierarchical, multilayered neural network inspired by the visual cortex. It introduced the concepts of S-cells (simple cells) and C-cells (complex cells) arranged in a hierarchy, allowing for position-invariant pattern recognition. This architecture laid the groundwork for modern convolutional neural networks and demonstrated that local feature extraction combined with spatial pooling could achieve robust visual recognition.
Backpropagation
University of California San Diego, Carnegie Mellon University, University of Toronto
Paper: Learning Representations by Back-propagating Errors
Backpropagation provided an efficient method for training multi-layer neural networks by computing gradients through the chain rule. This algorithm enabled the training of deep networks by propagating error signals backwards through layers, allowing hidden units to learn internal representations. Backpropagation became the workhorse of neural network training and remains fundamental to modern deep learning.

Long Short-Term Memory (LSTM)
Technische Universität München
Paper: Long Short-Term Memory
LSTM addressed the vanishing gradient problem in recurrent neural networks by introducing memory cells with gating mechanisms. The architecture uses input, output, and forget gates to control information flow, enabling networks to learn long-term dependencies. LSTM became the dominant architecture for sequence modeling tasks including speech recognition, machine translation, and time series prediction before the transformer era.
Convolutional Neural Networks (LeNet)
AT&T Bell Laboratories
Paper: Gradient-Based Learning Applied to Document Recognition
LeNet introduced a practical convolutional neural network architecture for document recognition. It combined convolutional layers for local feature extraction, pooling for spatial invariance, and fully connected layers for classification. This architecture demonstrated that CNNs could be trained end-to-end using backpropagation and achieved state-of-the-art results on handwritten digit recognition, establishing the blueprint for modern computer vision systems.

The Neural Probabilistic Language Model
Université de Montréal
Paper: A Neural Probabilistic Language Model
The Neural Probabilistic Language Model addressed the curse of dimensionality in language modeling by learning distributed representations for words. It introduced the idea that similar words would have similar vector representations, allowing the model to generalize to unseen word sequences. This foundational work pioneered the use of neural networks for language modeling and word embeddings, directly inspiring Word2Vec and modern language models.

ImageNet Dataset
Princeton University
Paper: ImageNet: A Large-Scale Hierarchical Image Database
ImageNet created a large-scale dataset with over 14 million labeled images across thousands of categories, organized hierarchically using WordNet. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the premier benchmark for computer vision. ImageNet's scale and diversity enabled the training of deep neural networks and catalyzed the deep learning revolution, particularly with AlexNet's breakthrough in 2012.

Xavier/Glorot Initialization
Université de Montréal
Paper: Understanding the Difficulty of Training Deep Feedforward Neural Networks
Xavier initialization provided a principled method for initializing neural network weights to maintain consistent variance of activations and gradients across layers. By scaling initial weights based on the number of input and output connections, it prevented vanishing or exploding gradients during training. This simple but crucial technique enabled the training of much deeper networks and remains a standard practice in deep learning.
Rectified Linear Unit (ReLU) Activation
University of Toronto
Paper: Rectified Linear Units Improve Restricted Boltzmann Machines
ReLU introduced a simple non-saturating activation function f(x) = max(0, x) that addressed the vanishing gradient problem of sigmoid and tanh activations. ReLU enabled faster training, reduced computational cost, and induced sparsity in neural networks. Despite its simplicity, ReLU became the default activation function for deep neural networks and enabled the training of much deeper architectures.
AlexNet
University of Toronto
Paper: ImageNet Classification with Deep Convolutional Neural Networks
GitHub repo: cuda-convnet2
AlexNet won the ImageNet 2012 competition with a significant margin, demonstrating that deep convolutional networks trained with GPUs could dramatically outperform traditional computer vision methods. The architecture combined ReLU activations, dropout regularization, data augmentation, and GPU training. AlexNet's success marked the beginning of the deep learning era and sparked intense interest in neural networks across academia and industry.
Dropout
University of Toronto
Paper: Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Dropout introduced a powerful regularization technique by randomly dropping units during training, preventing co-adaptation of features. This simple method significantly reduced overfitting in deep neural networks by training an ensemble of exponentially many sub-networks. Dropout became a standard regularization technique and enabled the training of larger networks without excessive overfitting.

Word2Vec
Paper: Efficient Estimation of Word Representations in Vector Space
GitHub repo: word2vec
Word2Vec introduced efficient methods (Skip-gram and CBOW) for learning dense vector representations of words from large corpora. These embeddings captured semantic and syntactic relationships, enabling vector arithmetic like 'king' - 'man' + 'woman' ≈ 'queen'. Word2Vec revolutionized natural language processing by providing a scalable way to represent words as continuous vectors, becoming foundational for modern NLP.
Variational Autoencoder (VAE)
University of Amsterdam
Paper: Auto-Encoding Variational Bayes
VAE introduced a probabilistic approach to learning latent representations by combining variational inference with neural networks. It learns a distribution over latent codes rather than deterministic encodings, enabling both efficient inference and generation. VAE provided a principled framework for generative modeling and became influential in unsupervised learning, representation learning, and generative AI.

Generative Adversarial Network (GAN)
Université de Montréal
Paper: Generative Adversarial Networks
GAN introduced a game-theoretic framework where a generator network learns to create realistic data by competing against a discriminator network. This adversarial training process enabled the generation of highly realistic images without requiring explicit modeling of probability distributions. GANs revolutionized generative modeling and spawned numerous applications in image synthesis, style transfer, and data augmentation.
Adam Optimizer
OpenAI, University of Toronto
Paper: Adam: A Method for Stochastic Optimization
Adam combined the benefits of AdaGrad and RMSProp by computing adaptive learning rates for each parameter using estimates of first and second moments of gradients. It included bias correction terms and proved robust across a wide range of problems with minimal hyperparameter tuning. Adam became the most widely used optimizer in deep learning due to its efficiency, ease of use, and strong empirical performance.

Sequence-to-Sequence Learning
Paper: Sequence to Sequence Learning with Neural Networks
GitHub repo: seq2seq
Seq2Seq introduced an end-to-end framework for sequence transduction using an encoder-decoder architecture with LSTMs. The encoder maps variable-length input sequences to fixed-size representations, which the decoder transforms into variable-length output sequences. This architecture unified many NLP tasks under a single framework and achieved breakthrough results in machine translation, establishing neural approaches as state-of-the-art.

Attention Mechanism
Université de Montréal
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau attention addressed the bottleneck in sequence-to-sequence models by allowing the decoder to focus on different parts of the input sequence at each decoding step. This attention mechanism computed context vectors as weighted sums of encoder hidden states, where weights were learned based on relevance. Attention became a fundamental building block of modern NLP systems and directly inspired the transformer architecture.

GloVe Word Embeddings
Stanford University
Paper: GloVe: Global Vectors for Word Representation
GitHub repo: GloVe
GloVe combined global matrix factorization with local context window methods for learning word embeddings. It trained on aggregated word-word co-occurrence statistics to produce vectors with meaningful linear substructures. GloVe provided an alternative to Word2Vec with strong performance on word analogy and similarity tasks, and its pre-trained vectors became widely used in NLP applications.
Neural Turing Machine
Google DeepMind
Paper: Neural Turing Machines
Neural Turing Machines extended neural networks by coupling them to external memory resources accessed through attention mechanisms. The entire system was differentiable end-to-end, allowing gradient-based training. NTMs demonstrated that neural networks could learn simple algorithms like copying, sorting, and associative recall from examples alone, showing that neural networks could exhibit more algorithmic and programmable behavior.

Batch Normalization
Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Batch Normalization normalized layer inputs across mini-batches, stabilizing training by reducing internal covariate shift. It enabled much higher learning rates, reduced sensitivity to initialization, and acted as a regularizer. Batch normalization dramatically accelerated training and became a standard component in deep networks, enabling the training of very deep architectures that were previously difficult to optimize.

Residual Networks (ResNet)
Microsoft Research
Paper: Deep Residual Learning for Image Recognition
GitHub repo: deep-residual-networks
ResNet introduced skip connections that allowed gradients to flow directly through networks by learning residual mappings. This simple architectural change enabled the training of networks with hundreds or even thousands of layers without degradation problems. ResNet won ImageNet 2015 and demonstrated that very deep networks could be effectively trained, fundamentally changing how we design neural network architectures.

Luong Attention (Global and Local Attention)
Stanford University
Paper: Effective Approaches to Attention-based Neural Machine Translation
Luong attention introduced two complementary attention mechanisms for neural machine translation: global attention, which attends to all source words, and local attention, which focuses on a subset of source positions. The paper also proposed multiplicative (dot-product) attention as a simpler alternative to additive attention. These mechanisms achieved significant improvements over non-attentional systems, with the local attention approach gaining 5.0 BLEU points. The work established a new state-of-the-art on WMT'15 English-German translation and influenced subsequent attention designs in transformers.
Layer Normalization
University of Toronto
Paper: Layer Normalization
Layer Normalization normalized inputs across features for each example independently, unlike batch normalization which normalized across the batch dimension. This made it particularly effective for recurrent neural networks and sequences of varying length. Layer normalization stabilized hidden state dynamics in RNNs and later became the standard normalization technique in transformer architectures.

Subword Units (BPE): Solving the Rare Word Problem
University of Edinburgh
Paper: Neural Machine Translation of Rare Words with Subword Units
GitHub repo: subword-nmt
Byte-Pair Encoding (BPE) adapted a data compression algorithm for neural machine translation, enabling open-vocabulary learning by breaking words into subword units. This solved the rare word problem by representing infrequent words as sequences of common subwords. BPE became the standard tokenization approach for language models, enabling models to handle any word while maintaining reasonable vocabulary sizes, and is used in GPT, BERT, and most modern LLMs.

Key-Value Memory Networks
Facebook AI Research, Carnegie Mellon University
Paper: Key-Value Memory Networks for Directly Reading Documents
Key-Value Memory Networks introduced a memory architecture that separates keys (used for addressing) from values (used for reading), enabling more effective question answering by directly reading documents. This separation allowed the model to use different encodings for matching queries to memory slots versus returning information, significantly improving performance on knowledge base and document-based QA tasks. The architecture influenced subsequent memory-augmented networks and retrieval-augmented generation systems.

Transformer Architecture
Paper: Attention Is All You Need
GitHub repo: tensor2tensor
The Transformer replaced recurrence and convolutions entirely with self-attention mechanisms, processing sequences in parallel rather than sequentially. It introduced multi-head attention, positional encodings, and a feedforward encoder-decoder structure. The Transformer achieved state-of-the-art translation results while being more parallelizable and requiring significantly less training time. This architecture became the foundation for modern large language models and revolutionized NLP.
Reinforcement Learning from Human Feedback (RLHF)
OpenAI, UC Berkeley, DeepMind
Paper: Deep Reinforcement Learning from Human Preferences
RLHF introduced a method for training RL agents using human preference comparisons rather than hand-crafted reward functions. Humans compared pairs of trajectory segments, and a reward model was trained to predict preferences. This reward model then guided policy optimization. RLHF scaled preference-based learning to complex tasks and later became crucial for aligning large language models with human values and intentions.

Sparsely-Gated Mixture of Experts
Paper: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Mixture of Experts introduced conditional computation where a gating network routes each input to a sparse subset of expert sub-networks. This enabled training models with orders of magnitude more parameters without proportional increases in computation. MoE demonstrated that model capacity could be dramatically increased through sparsity, achieving state-of-the-art results in language modeling and translation. This approach later influenced large-scale models like GPT-4.
Proximal Policy Optimization (PPO)
OpenAI
Paper: Proximal Policy Optimization Algorithms
GitHub repo: baselines
PPO introduced a simpler and more stable policy gradient method by clipping the objective function to prevent excessively large policy updates. It combined the benefits of trust region methods with the simplicity of first-order optimization. PPO became the most widely used reinforcement learning algorithm due to its robustness, ease of implementation, and strong empirical performance across diverse tasks.
ELMo (Embeddings from Language Models)
Allen Institute for AI, University of Washington
Paper: Deep Contextualized Word Representations
GitHub repo: bilm-tf
ELMo generated context-dependent word representations by using bidirectional LSTMs trained as language models. Unlike static embeddings, ELMo representations varied based on context, capturing polysemy and complex linguistic features. ELMo demonstrated the power of pre-training and fine-tuning, significantly improving performance across diverse NLP tasks. It was a crucial step toward modern contextualized language models and transfer learning in NLP.
GPT (Generative Pre-Training)
OpenAI
Paper: Improving Language Understanding by Generative Pre-Training
GitHub repo: finetune-transformer-lm
GPT introduced a two-stage approach: unsupervised pre-training of a transformer language model on large text corpora, followed by supervised fine-tuning on specific tasks. This demonstrated that language models could learn general representations useful across many tasks. GPT showed that pre-training could significantly reduce the labeled data required for downstream tasks, establishing the pre-train-then-fine-tune paradigm that dominated subsequent NLP research.

BERT (Bidirectional Encoder Representations from Transformers)
Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GitHub repo: bert
BERT pre-trained bidirectional transformers using masked language modeling and next sentence prediction. Unlike previous unidirectional models, BERT jointly conditioned on both left and right context in all layers. BERT achieved state-of-the-art results across eleven NLP tasks and demonstrated that deeply bidirectional pre-training was crucial for language understanding. BERT became the foundation for numerous downstream applications and variants.

Mixed Precision Training
NVIDIA
Paper: Mixed Precision Training
GitHub repo: apex
Micikevicius et al. showed how to safely train deep networks using half-precision (FP16) arithmetic while preserving full-precision accuracy. By keeping FP32 master weights, accumulating gradients in FP32, and using loss scaling to avoid underflow, they demonstrated 2–3× speedups on NVIDIA Tensor Cores without sacrificing convergence. Mixed precision became the standard recipe for large-scale transformer training, enabling today's models to fit within GPU memory budgets.
GPT-2
OpenAI
Paper: Language Models are Unsupervised Multitask Learners
GitHub repo: gpt-2
GPT-2 scaled up the original GPT to 1.5 billion parameters and trained on a larger, more diverse dataset. It demonstrated that language models could perform many tasks zero-shot without fine-tuning by simply conditioning on appropriate prompts. GPT-2 showed strong performance on diverse tasks including translation, summarization, and question answering, suggesting that with sufficient scale and data, language models naturally learn multitask capabilities.

T5 (Text-to-Text Transfer Transformer)
Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
GitHub repo: text-to-text-transfer-transformer
T5 unified all NLP tasks into a text-to-text format where both inputs and outputs are text strings. It systematically explored transfer learning techniques including pre-training objectives, architectures, datasets, and fine-tuning methods. T5's encoder-decoder architecture and comprehensive evaluation provided insights into what makes transfer learning effective. The unified framework simplified multi-task learning and became influential for instruction-following models.

The Bitter Lesson
University of Alberta, DeepMind
Paper: The Bitter Lesson (Essay)
The Bitter Lesson essay argued that general methods leveraging computation consistently outperform approaches that rely on human knowledge in the long run. Sutton observed that search and learning, when given sufficient computation, surpass hand-crafted features and domain expertise. This philosophical perspective influenced the field to focus on scalable learning methods rather than encoding human knowledge, providing intellectual foundation for the scaling paradigm in modern AI.
Scaling Laws for Neural Language Models
OpenAI
Paper: Scaling Laws for Neural Language Models
This work empirically demonstrated that language model performance scales as power-laws with model size, dataset size, and compute budget. The research showed predictable relationships between these factors and suggested optimal allocation strategies. These scaling laws provided quantitative guidance for training large models and predicted that simply scaling up models would continue to yield improvements, influencing subsequent investment in large-scale model development.
GPT-3
OpenAI
Paper: Language Models are Few-Shot Learners
GPT-3 scaled transformers to 175 billion parameters, demonstrating that language models could perform diverse tasks with few-shot, one-shot, or zero-shot learning from prompts alone. It showed impressive performance on translation, question-answering, arithmetic, and novel word usage without gradient updates. GPT-3 revealed that with sufficient scale, language models develop broad capabilities and sparked widespread interest in large language models and prompt engineering.

ZeRO (Zero Redundancy Optimizer)
Microsoft
Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
GitHub repo: DeepSpeed
ZeRO eliminated memory redundancies in data-parallel distributed training by partitioning optimizer states, gradients, and parameters across devices rather than replicating them. ZeRO enabled training models with trillions of parameters by dramatically reducing per-device memory requirements while maintaining computational efficiency. This optimization became crucial for training large language models and is implemented in DeepSpeed, enabling the scale of models like GPT-3 and beyond.

RoFormer: Rotary Position Embedding (RoPE)
Zhuiyi Technology
Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding
GitHub repo: roformer
Rotary Position Embedding (RoPE) encodes position information by rotating word embeddings based on their absolute positions, while naturally encoding relative position information through the rotation properties. RoPE provided better extrapolation to longer sequences than previous position encoding methods while being computationally efficient. It was adopted by influential models including PaLM, LLaMA, and many other modern LLMs, becoming a preferred position encoding technique.

LoRA: Low-Rank Adaptation of Large Language Models
Microsoft
Paper: LoRA: Low-Rank Adaptation of Large Language Models
GitHub repo: LoRA
LoRA enabled efficient fine-tuning of large language models by training low-rank decomposition matrices that are added to frozen pre-trained weights. This reduced trainable parameters by 10,000x and memory requirements by 3x while maintaining or exceeding full fine-tuning performance. LoRA made it practical to customize large models for specific tasks with limited compute resources, democratizing access to fine-tuning and enabling rapid adaptation of foundation models.
InstructGPT
OpenAI
Paper: Training Language Models to Follow Instructions with Human Feedback
InstructGPT fine-tuned GPT-3 using supervised learning on human-written demonstrations followed by reinforcement learning from human feedback. Despite having 100x fewer parameters, InstructGPT outputs were preferred to GPT-3 outputs. The model showed improvements in truthfulness, helpfulness, and reduced toxicity. InstructGPT demonstrated that alignment with human preferences through RLHF was crucial for making language models useful and safe, establishing the approach used in ChatGPT.

Chain-of-Thought Prompting
Google Research
Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting enabled language models to solve complex reasoning tasks by generating intermediate reasoning steps before arriving at final answers. Simply adding a few examples with reasoning chains dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. This technique revealed emergent reasoning capabilities in large models and demonstrated that prompting strategies could unlock latent abilities without additional training.

FlashAttention
Stanford University
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
GitHub repo: flash-attention
FlashAttention optimized the attention mechanism by accounting for GPU memory hierarchy, using tiling to reduce data movement between GPU memory levels. This IO-aware algorithm achieved exact attention with significantly reduced memory usage and 2-4x speedup compared to standard implementations. FlashAttention enabled training transformers with much longer context lengths and became widely adopted, fundamentally improving the efficiency of transformer models.
Constitutional AI: Harmlessness from AI Feedback
Anthropic
Paper: Constitutional AI: Harmlessness from AI Feedback
Constitutional AI introduced a method for training harmless AI assistants using AI-generated feedback based on a set of principles (a 'constitution') rather than relying solely on human feedback. The model critiques and revises its own responses according to constitutional principles, then learns from these self-improvements. This approach reduced reliance on human labelers for harmlessness training while making the values guiding AI behavior more transparent and debuggable.

Direct Preference Optimization (DPO)
Stanford University
Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
GitHub repo: direct-preference-optimization
DPO simplified preference learning by directly optimizing language models on human preferences without requiring a separate reward model or reinforcement learning. It reformulated RLHF as a classification problem over preference pairs, making training more stable and efficient. DPO achieved comparable or better results than RLHF while being simpler to implement and tune, becoming a popular alternative for aligning language models with human preferences.

QLoRA: Efficient Fine-Tuning of Quantized LLMs
University of Washington
Paper: QLoRA: Efficient Finetuning of Quantized LLMs
GitHub repo: qlora
QLoRA combined quantization with LoRA to enable fine-tuning of extremely large models on consumer hardware. It quantized the base model to 4-bit precision while using LoRA adapters in higher precision, maintaining full fine-tuning performance. QLoRA made it possible to fine-tune a 65B parameter model on a single GPU with 48GB memory, dramatically democratizing access to fine-tuning large language models and enabling researchers with limited resources to customize state-of-the-art models.

Mixture-of-Experts (MoE) and Test-Time Compute Scaling
Google, xAI, DeepSeek
Paper: Mixtral of Experts
Modern Mixture-of-Experts architectures like Mixtral, Grok, and DeepSeek-V2 combined sparse routing with test-time compute scaling, allowing models to dynamically allocate computation based on task difficulty. These architectures activated only a subset of parameters per token while maintaining large total capacity, achieving better performance-per-compute ratios. The combination with test-time scaling, where models use more computation for harder problems, represented a shift toward more efficient and adaptive AI systems.

Layer Dropping and Progressive Pruning (TrimLLM)
Northeastern University, Indiana University Bloomington, University of Connecticut, University of Massachusetts Dartmouth, North Carolina State University
Paper: TrimLLM: Progressive Layer Dropping for Efficient LLM Inference
Layer dropping and progressive pruning techniques enabled efficient inference by selectively skipping or removing transformer layers based on input characteristics or layer importance. Research showed that many layers in large language models are redundant for certain tasks, and adaptive layer selection could maintain performance while reducing computation. These techniques became important for deploying large models in resource-constrained environments and improving inference efficiency.

Multimodal Secure Alignment
Carnegie Mellon University, University of Washington
Paper: Defending Against Jailbreak Attacks in Multimodal Language Models
Multimodal secure alignment addresses unique safety challenges when language models process multiple modalities (text, images, audio). Research revealed that multimodal models could be more vulnerable to jailbreaks through adversarial images or cross-modal attacks. New alignment techniques were developed to ensure consistent safety behavior across modalities, including modality-specific safety layers and cross-modal consistency checking. This work became critical as vision-language models like GPT-4V and Gemini became widely deployed.
Chain-of-Thought Monitorability
Google DeepMind, Anthropic
Paper: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
This work reframed chain-of-thought (CoT) safety monitoring around monitorability rather than faithfulness, distinguishing CoT-as-rationalization from CoT-as-computation. By making harmful behaviors require multi-step reasoning, the authors forced models to expose their plans and showed that CoT monitoring can detect severe risks unless attackers receive substantial assistance. The paper also offered stress-testing guidelines, concluding that CoT monitoring remains a valuable, if imperfect, layer of defense that warrants active protection and continual evaluation.
Critical Representation Fine-Tuning (CRFT)
Zhejiang University, Alibaba Cloud Computing
Paper: Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
CRFT extends Representation Fine-Tuning by identifying "critical" hidden representations that aggregate or gate reasoning signals, then editing them in a low-rank subspace while freezing the base LLaMA or Mistral weights. Information-flow analysis selects these high-leverage states, enabling large CoT accuracy gains across eight arithmetic and commonsense benchmarks as well as sizeable one-shot improvements. The work highlights that representation-level PEFT can unlock better reasoning without touching most model parameters.

Mechanistic OOCR Steering Vectors
Massachusetts Institute of Technology, Independent Researchers
Paper: Simple Mechanistic Explanations for Out-Of-Context Reasoning
This study dissects out-of-context reasoning (OOCR) and finds that many reported cases arise because LoRA fine-tuning effectively adds a constant steering vector that pushes models toward latent concepts. By extracting or directly training such steering vectors, the authors reproduce OOCR across risky/safe decision, function, location, and backdoor benchmarks, showing that unconditional steering can even implement conditional behaviors. The results provide a simple mechanistic account of why fine-tuned LLMs can generalize far beyond their training distribution and highlight the alignment implications of steering-vector interventions.

Continuous Thought Machines (CTM)
Sakana AI
Paper: Continuous Thought Machines
GitHub repo: ctm
The Continuous Thought Machine (CTM) introduces a novel neural network architecture that integrates neuron-level temporal processing and neural synchronization to reintroduce neural timing as a foundational element in artificial intelligence. Unlike standard neural networks that ignore the complexity of individual neurons, the CTM leverages neural dynamics as its core representation through two key innovations: neuron-level temporal processing where each neuron uses unique weight parameters to process incoming signal histories, and neural synchronization as a latent representation. The CTM demonstrates strong performance across diverse tasks including ImageNet-1K classification, 2D maze solving, sorting, parity computation, question-answering, and reinforcement learning, while naturally supporting adaptive computation where it can stop earlier for simpler tasks or continue processing for more challenging instances.
Constitutional Classifiers++
Anthropic
Paper: Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks
Constitutional Classifiers++ delivers production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. The system combines exchange classifiers that evaluate model responses in full conversational context, a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers, and efficient linear probe classifiers ensembled with external classifiers. These techniques achieve a 40x computational cost reduction compared to baseline exchange classifiers while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, the work demonstrates strong protection against universal jailbreaks.