Tag: interpretability

5 topic(s)

Mechanistic OOCR Steering VectorsMechanistic OOCR steering vectors are a proposed explanation for some out-of-context reasoning results: fine-tuning can act like adding an approximately constant steering direction to the residual stream, rather than learning a deeply conditional new algorithm. That helps explain why a tuned behavior can generalize far beyond the fine-tuning data and why injecting or subtracting the vector can often reproduce or remove it.
Induction HeadsA specific two-head circuit in Transformer attention that copies the next token after a previous occurrence of the current token — the computational basis for in-context learning. Anthropic showed induction heads form suddenly during training, coinciding with the sharp jump in ICL ability.
Circuit AnalysisThe mechanistic-interpretability practice of identifying subgraphs of weights, residual-stream components, and attention heads that jointly implement a human-interpretable algorithm (indirect object identification, modular addition, greater-than). Circuit analysis produces falsifiable, causal accounts of what a network has learned.
ROME / MEMIT Model EditingRank-one edits to MLP weights that inject a single fact (ROME) or thousands of facts (MEMIT) into a pretrained LLM without retraining. They exploit the observation that MLP blocks act as key–value memories, identify the causal neurons via activation patching, and solve a closed-form optimisation problem for the minimal-norm weight update.
In-Context Learning MechanismsIn-context learning (ICL) is the empirical phenomenon that a frozen LLM solves new tasks from few-shot examples in the prompt. Mechanistic studies show ICL is implemented by a small set of attention circuits — induction heads, function vectors, and implicit gradient-descent-like updates — that emerge during pretraining once the data and depth budget cross a threshold.