BERT

Step 1 of 8

The Bidirectionality Gap in 2018

By late 2018 the recipe for transferring pretrained representations to downstream NLP tasks had two dominant templates. ELMo (Peters et al., 2018) trained two independent language models — one left-to-right, one right-to-left — and concatenated their hidden states as contextual features fed into task-specific architectures. OpenAI GPT (Radford et al., 2018) pretrained a 12-layer Transformer decoder as a strict left-to-right language model and fine-tuned it end-to-end on each task. Both unlocked large gains over from-scratch training, and OpenAI GPT held the GLUE average at 75.1 heading into BERT's submission (Table 1, OpenAI GPT row).

BERT identifies a specific limitation in both approaches and claims it is the binding constraint. ELMo is only shallowly bidirectional: the two language models never condition on each other's context during pretraining, so a token's ELMo feature is the concatenation of two one-sided representations, not a jointly bidirectional one. OpenAI GPT is deeply contextual but strictly one-sided: at position \( i \) every layer attends only to tokens \( 1, \ldots, i \), so no layer of the pretrained model ever sees right-hand context. BERT's contribution is a single encoder Transformer in which every layer attends in both directions to the entire input, pretrained on a task that still has a well-defined, leak-free objective.

Three pretraining architectures, compared

ELMoOpenAI GPTBERTtwo indep. LMs, concatLTR Transformerbidirectional TransformerLTR L2LTR L1RTL L2RTL L1tokensconcatTransformer L2Transformer L1tokenspredict next token →Transformer L2Transformer L1tokens with [MASK]predict masked tokenattention: one-sided,two separate stacksattention: causal mask,left context onlyattention: unmasked,both directions per layer

Three pretraining architectures. ELMo trains independent LTR and RTL language models and concatenates their hidden states. OpenAI GPT uses a causal-masked Transformer that sees only left context. BERT removes the causal mask and uses a bidirectional Transformer with a masked-token objective. Structural redraw of Figure 3 of arXiv:1810.04805.

"Shallowly bidirectional" is not a nitpick

Common misconception: ELMo is bidirectional, so BERT's bidirectionality is an incremental improvement. In ELMo the left-to-right and right-to-left language models are trained independently and combined only at the final representation layer; no layer of either LM conditions on the other LM's features. In BERT every self-attention sub-layer sees every token in the input on both sides at once. A token's representation at layer 12 has been jointly refined against its neighbors in both directions for all 12 steps of the stack, not just concatenated with a frozen feature from the opposite direction.

The obstacle a bidirectional LM has to avoid

There is a reason left-to-right was the default. A standard language-modeling loss predicts token \( x_i \) from tokens \( x_1, \ldots, x_{i-1} \); a deeply bidirectional model would see \( x_i \) itself in its own context on higher layers, so the prediction task becomes trivial — the token leaks from lower layers to higher ones through the attention stack. BERT's central engineering move is to change the task: mask a fraction of tokens at the input and predict the original values from the unmasked positions. With the masked tokens replaced by a placeholder, bidirectional attention is safe because the information to be predicted was removed before it could reach any layer.

The paper's headline claim

A single deeply bidirectional Transformer encoder, pretrained with a masked-language-model objective and a sentence-pair prediction objective, then fine-tuned end-to-end on each downstream task, advances the state of the art on eleven NLP benchmarks. On GLUE, BERT-LARGE reaches an average of 82.1, a 7.0-point absolute gain over OpenAI GPT's 75.1 (Table 1). On SQuAD 1.1 test, it reaches 93.2 F1 as an ensemble, surpassing the human performance of 91.2 for the first time (Table 2). The remainder of this study covers the architecture and input format (§2), the two pretraining objectives (§3–4), the training recipe (§5), the benchmark results (§6), and the ablations that isolate what actually mattered (§7).