Machine LearningWikiPaths

LLM Systems

The engineering behind large models. Scaling laws, tokenization, KV cache optimization, and high-performance serving.

Estimated time: ~120 min

Study this path with flashcards
6 cards
Study →
  1. Step 1
    Scaling laws are empirical relationships showing how loss or capability changes with model size, data, and compute, often following approximate power laws. They matter because they let researchers forecast returns to scale and choose more compute-efficient training regimes.
  2. Step 2
    Chinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
  3. Step 3
    Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
  4. Step 4
    Two allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
  5. Step 5
    Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
  6. Step 6
    Disaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.