The Transformer

Step 1 of 9

The Bottleneck of Recurrence

Before the Transformer, the dominant architectures for sequence transduction — machine translation, summarization, parsing — were recurrent networks, most often LSTMs or GRUs wrapped in an encoder-decoder framework. A recurrent encoder consumes the source tokens one at a time, folding each new input \( x_t \) into a hidden state that summarizes everything seen so far:

Recurrent update

\[ h_t = f(h_{t-1}, x_t) \]

Here \( h_t \in \mathbb{R}^{d} \) is the hidden state at position \( t \), \( x_t \) is the input embedding, and \( f \) is the recurrent cell (an LSTM gate stack, a GRU, etc.). The state \( h_t \) cannot be computed until \( h_{t-1} \) is known. For a sequence of length \( n \), the encoder therefore performs \( n \) sequential steps, each dependent on the last.

Why RNNs were displaced

Common misconception: RNNs were replaced because they could not learn long-range dependencies. The vanishing-gradient problem is real, and LSTMs and gating substantially mitigated it. The decisive issue in the GPU era is different: recurrence forbids parallelism along the sequence axis. A batch of 64 sentences of length 128 requires 128 sequential steps per training example regardless of how many tensor cores sit idle. The Transformer discards recurrence entirely so that every token position is computed in parallel within a layer.

The paper's claim is sharp: a model built from self-attention, feed-forward layers, and residual connections — with no recurrence and no convolution — can match or beat the best recurrent translation systems of 2017 while training an order of magnitude faster. The remainder of this study unpacks each mechanical component, then the training recipe and the empirical results that backed the claim.