Machine LearningWikiPaths

Transformer Variants

Beyond vanilla attention. Explore Mixture-of-Experts (MoE), linear attention, and state-space models like Mamba.

Estimated time: ~90 min

Study this path with flashcards
5 cards
Study →
  1. Step 1
    Attention computes a context-dependent weighted combination of values, where the weights come from similarities between queries and keys. It lets a model focus on the most relevant parts of an input instead of compressing everything into one fixed vector.
  2. Step 2
    A sparse mixture-of-experts layer replaces one dense feed-forward block with many expert subnetworks, but routes each token to only a small subset such as top-1 or top-2 experts. This increases parameter count and specialization without increasing per-token compute proportionally.
  3. Step 3
    A sparsely-gated Mixture of Experts (MoE) layer routes each token to only a small subset of expert networks, so model capacity can grow much faster than compute per token. Its central challenge is routing and load balancing: without auxiliary losses, a few experts tend to monopolize traffic.
  4. Step 4
    Linear attention is the family of attention mechanisms that rewrites or approximates softmax attention so sequence processing scales roughly linearly instead of quadratically with length. The benefit is efficiency on long contexts, but the tradeoff is that exact softmax behavior is usually lost.
  5. Step 5
    State space models such as Mamba process sequences by evolving a learned hidden state through recurrence rather than full quadratic attention. Their main appeal is linear-time sequence processing with strong long-context efficiency, especially when selective state updates let the model decide what to remember.