Alignment & Safety

Ensuring models behave predictably. RLHF, DPO, reward modeling, and the frontiers of mechanistic interpretability.

Estimated time: ~150 min

Study this path with flashcards

10 cards

Study →

Step 1
Reinforcement Learning from Human Feedback (RLHF)
RLHF aligns a model by collecting human preference data, training a reward model on those comparisons, and then optimizing the policy to maximize reward while staying close to a reference model. It improved helpfulness and instruction following, but it can also create reward hacking and training instability.
Step 2
Direct Preference Optimization (DPO)
DPO learns directly from preference pairs by making chosen responses more likely than rejected ones without running a separate RL loop. It can be derived from a KL-constrained reward-maximization view, which is why it is often presented as a simpler alternative to PPO-based RLHF.
Step 3
Iterative DPO
Iterative DPO repeats a simple loop: sample responses from the current policy, score or label them, apply a DPO-style update, and repeat. It brings online data collection to preference optimization while keeping DPO's simpler training dynamics compared with PPO-based RLHF.
Step 4
KL-Divergence Penalty in RLHF
The KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
Step 5
Reward Hacking & Specification Gaming in RLHF
When a learned reward model is a proxy for human preference, RL optimisation finds adversarial inputs that maximise reward without matching the true objective — verbose apologies, sycophancy, confident wrong answers, format exploitation. Goodhart's law in practice. Mitigations range from KL penalty and reward normalisation to process supervision and debate.
Step 6
Hallucination
Hallucination is when a model produces content that is unsupported or false while presenting it as if it were correct. In language models it often comes from next-token training, weak grounding, or overconfident decoding rather than deliberate deception.
Step 7
Safety Alignment
Safety alignment is the process of making a model reliably avoid harmful, deceptive, or policy-violating behavior while remaining useful. In practice it combines data curation, supervised tuning, preference optimization or RLHF, classifiers, and adversarial evaluation, but it never guarantees perfect safety.
Step 8
Interpretability
Interpretability is the study of making model behavior understandable to humans, whether by explaining predictions, revealing learned features, or analyzing internal structure. It matters because debugging, trust, scientific understanding, and safety all depend on seeing more than just inputs and outputs.
Step 9
Mechanistic Interpretability
Mechanistic interpretability treats a neural network as a system to be reverse-engineered into circuits, features, and algorithms. Its goal is not just to correlate neurons with concepts, but to identify the actual internal computations that produce behavior.
Step 10
Sparse Autoencoder (Mechanistic Interpretability)
In mechanistic interpretability, a sparse autoencoder is trained on model activations to decompose dense, superposed representations into a larger set of sparse features. This often makes latent structure more interpretable, because individual learned directions can line up with human-readable concepts or behaviors.