Mamba

Step 1 of 9

Why Another Recurrent Model in 2023

By late 2023 the Transformer had dominated sequence modeling for six years, but its quadratic cost in sequence length \( L \) kept forcing context windows to grow through systems work rather than architectural progress. A parallel line of research on structured state space models (S4, H3, Hyena) offered \( O(L) \) training and \( O(1) \) per-step inference by replacing self-attention with a linear recurrence over a learned continuous-time dynamical system. On long-range benchmarks like Path-X these models were state of the art; on language modeling they lagged Transformers of the same parameter count by a visible margin.

Mamba identifies the cause of that gap and fixes it. Prior SSMs are linear time-invariant (LTI): the same recurrence matrices apply at every step, so the model cannot decide to attend to a specific token or ignore a filler one. The paper introduces a selection mechanism that makes the step size \( \Delta \) and the input/output projections \( B, C \) functions of the current token, breaking time-invariance and letting the hidden state copy, skip, or erase content as needed. A hardware-aware parallel scan keeps the recurrence \( O(L) \) in wall-clock time despite the loss of convolutional structure. The result is Mamba: a Transformer-free language model that matches or exceeds Transformer++ of twice its parameter count across scales from 125M to 2.8B, with 5× higher inference throughput (Section 4.2, Table 3 of arXiv:2312.00752).

What an SSM actually is

Common misconception: a state space model is a new class of architecture distinct from recurrent networks. An SSM is a linear RNN — a discretized linear ODE — with structured (diagonal or diagonal-plus-low-rank) transition matrices and a principled continuous-time initialization. The novelty of the S4 line was not recurrence but that structure and initialization let a linear RNN train stably over sequences of tens of thousands of tokens, where an LSTM collapses.

Two synthetic tasks that break LTI SSMs

Section 3.1 constructs two toy tasks that isolate the limitation. Selective Copying asks the model to emit a subset of input tokens (those flagged by a marker) in order, skipping the rest; Induction Heads requires recalling the token that followed a given pattern on an earlier occurrence. Both tasks demand content-aware behavior — the recurrence needs to act differently depending on what it just saw. An LTI recurrence cannot do this by construction: its transition matrix is fixed, so whatever compression it learned for averaged statistics applies uniformly. S4 and H3 score near chance on Selective Copying; Mamba reaches effectively perfect accuracy and extrapolates Induction Heads to sequences 4000× longer than those seen during training (Table 1, Figure 2).