Sparsely-Gated MoE

Step 1 of 8

The Capacity-Compute Tradeoff

A dense neural network pays for every parameter on every input. Doubling the width of a feed-forward layer doubles both its capacity and its compute per token. For a 2017-era language model trained on billions of words, this coupling was the binding constraint: training a larger model for the same wall-clock budget meant fewer steps over less data, which usually erased the capacity gain.

The Sparsely-Gated Mixture-of-Experts layer decouples the two. It contains up to tens of thousands of expert sub-networks, but each input token activates only a handful of them. Total parameter count grows with the expert count; compute per token stays bounded by \( k \), the number of experts actually evaluated.

MoE is not an ensemble

Common misconception: a mixture-of-experts layer averages the outputs of many models, so it is an ensemble with learned weights. The routing is sparse: for \( n = 65{,}536 \) experts with \( k = 4 \), any single token sees only 4 of them — about 0.006% of the layer — and the remaining experts contribute literally zero. An ensemble runs every member on every input. MoE runs almost none.

Why prior conditional computation did not catch on

The idea of routing inputs through a subset of a network was not new. Earlier attempts used non-differentiable hard gates trained with REINFORCE-style gradient estimators, or stochastic binary gates that suffered from high-variance updates. On batched GPU hardware, sparse routing also collapsed to dense compute: the minibatch still had to fan out to every branch, so the theoretical savings disappeared. The paper's contribution is a gate that is simultaneously sparse, differentiable through a softmax, and batched efficiently across data-parallel workers.

The paper's claim

A single sparsely-gated MoE layer inserted into a recurrent language model can push model capacity from millions to tens of billions of parameters — up to 137 billion in the largest configuration (Table 8) — while keeping per-token compute within a factor of two of the dense baseline, and deliver large perplexity and BLEU gains on 1-billion-word language modeling and WMT'14 translation. The remainder of this study unpacks the gate, the auxiliary losses that make training stable, the architectural plumbing that makes the sparsity actually pay off on a GPU cluster, and the empirical numbers that backed the claim.