LLM Systems
The engineering behind large models. Scaling laws, tokenization, KV cache optimization, and high-performance serving.
Study this path with flashcards
6 cards
- Step 1Scaling laws are empirical relationships showing how loss or capability changes with model size, data, and compute, often following approximate power laws. They matter because they let researchers forecast returns to scale and choose more compute-efficient training regimes.
- Step 2Chinchilla scaling laws showed that many large language models were undertrained for their size under fixed compute budgets. The central prescription is to train smaller models on more tokens than the earlier parameter-heavy frontier, yielding better compute-optimal performance.
- Step 3Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
- Step 4Two allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
- Step 5Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
- Step 6Disaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.