Transformer Variants
Beyond vanilla attention. Explore Mixture-of-Experts (MoE), linear attention, and state-space models like Mamba.
Study this path with flashcards
5 cards
- Step 1Attention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
- Step 2A sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
- Step 3A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
- Step 4Linear attention is the family of attention mechanisms that rewrites or approximates softmax attention so sequence processing scales roughly linearly instead of quadratically with length. The benefit is efficiency on long contexts, but the tradeoff is that exact softmax behavior is usually lost.
- Step 5State space models such as Mamba process sequences by evolving a learned hidden state through recurrence rather than full quadratic attention. Their main appeal is linear-time sequence processing with strong long-context efficiency, especially when selective state updates let the model decide what to remember.