DPO

Step 1 of 7

The RLHF Pipeline and Why It Hurt

By early 2023 the standard recipe for turning a pretrained language model into an assistant was a three-stage pipeline: supervised fine-tuning on demonstrations, training a separate reward model on human preference comparisons, and then reinforcement learning against that reward model with PPO. The pipeline — RLHF — produced InstructGPT, ChatGPT, Claude 1, and every other frontier assistant of the period. It also broke in ways that its users did not enjoy debugging.

DPO's argument is one sentence long: the reward-model step and the RL step are both unnecessary. A single classification-style loss on preference pairs optimizes the exact same objective that the two-stage RLHF pipeline approximates, without rollouts, without a reward network, and without PPO's hyperparameters. The paper titled its insight "your language model is secretly a reward model" because the mathematical reparametrization that made this work showed the policy itself carries the reward-model signal implicitly.

What the RLHF pipeline actually runs

The paper's §3 lays out the two-phase objective that DPO replaces. Phase 1 fits a reward model \( r_\phi(x, y) \) by maximum likelihood under the Bradley–Terry model, given a dataset \( \mathcal{D} = \{(x, y_w, y_l)\} \) of prompts \( x \) with a preferred completion \( y_w \) and a dispreferred completion \( y_l \). Phase 2 optimizes the policy against that frozen reward:

\[ \max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_\theta(\cdot|x)} \big[ r_\phi(x, y) \big] \; - \; \beta \, D_{\text{KL}} \big[ \pi_\theta(\cdot|x) \, \Vert \, \pi_{\text{ref}}(\cdot|x) \big] \]

Symbols: \( \pi_\theta \) is the trainable policy; \( \pi_{\text{ref}} \) is the frozen reference policy (typically the SFT checkpoint); \( r_\phi \) is the frozen reward model; \( \beta > 0 \) is the KL penalty coefficient controlling how far the policy may drift; \( y \sim \pi_\theta(\cdot|x) \) is a completion sampled from the policy. PPO is then applied to this reward landscape.

The KL term is not optional

Common misconception: the \( \beta \, D_{\text{KL}} \) term is a regularizer bolted on to stop the policy from wandering. It is the defining constraint of the objective — without it, the reward model's imperfect generalization lets the policy reward-hack to nonsense completions that score high under \( r_\phi \) but are gibberish to a human. The optimal policy derived on page 3, and the DPO loss derived on page 4, both depend on the exact form of this constrained objective: change the constraint and the DPO algebra breaks.

What hurt about running this in practice

Three failure modes are cited across §1 and §6 of the paper. First, PPO requires on-policy sampling: generating \( y \sim \pi_\theta \) at every step, scoring each sample with \( r_\phi \), and back-propagating through a clipped surrogate loss. Memory has to hold the policy, the reward model, and (typically) a separate critic network simultaneously, often alongside a frozen reference model for the KL term. Second, PPO is famously sensitive to learning rate, clip ratio, value-loss coefficient, and rollout length — tuning a frontier-scale run is a major engineering effort. Third, the reward model overfits its preference training set and becomes a leaky proxy: as the policy climbs \( r_\phi \), downstream human evaluations often plateau or regress, the reward-hacking failure mode. DPO removes two of the three pipeline stages and, as §6 shows, keeps the alignment quality.

RLHF pipeline vs DPO pipeline

RLHF's three-stage pipeline collapsed to DPO's single classification loss on the preference dataset. Structural redraw based on §1 of arXiv:2305.18290.