Machine LearningWikiPaths

Reinforcement Learning from Zero to RLHF

Construct a journey from basic decision processes to aligning complex models.

Estimated time: ~60 min

Study this path with flashcards
40 cards
Study →
  1. Step 1
    A Markov decision process formalizes sequential decision-making with states, actions, transitions, rewards, and a discount factor. Its key assumption is that the next-state and reward distribution depends only on the current state and action, which makes Bellman-style planning possible.
  2. Step 2
    The Markov property says that the conditional distribution of the future depends only on the present state, not on the full past history, once the state is known. It is the defining assumption behind Markov chains and MDPs and tells you when a state representation is sufficient for planning.
  3. Step 3
    A value function estimates expected future return, either from a state or from a state-action pair under a policy. It matters because it turns delayed rewards into local training signals, enabling planning, bootstrapping, and lower-variance policy gradients.
  4. Step 4
    In reinforcement learning, a policy is the rule that maps states or observations to actions, often as a probability distribution. Learning a policy means directly improving behavior, and in language-model RL the policy is the model’s distribution over tokens or completions conditioned on context.
  5. Step 5
    Reinforcement learning studies how an agent should act through trial and error to maximize cumulative reward in an environment. Unlike supervised learning, feedback is delayed and depends on the agent's own actions, so the problem is about sequential decision-making as much as prediction.
  6. Step 6
    A reward signal is the scalar feedback an RL agent receives about the desirability of its behavior. Because the agent optimizes whatever reward it is given, the design of the reward signal determines whether learning produces the intended behavior or merely exploits a proxy.
  7. Step 7
    Actor-critic methods learn a policy and a value estimator at the same time. The actor chooses actions, while the critic estimates return and supplies a lower-variance learning signal than raw returns alone.
  8. Step 8
    The credit assignment problem is the problem of determining which earlier actions, states, or internal computations deserve blame or credit for a later outcome. It is hard because rewards and losses are often delayed, sparse, or distributed across many interacting decisions.
  9. Step 9
    Dynamic programming solves an MDP with a known model by repeatedly applying Bellman updates until values or policies become self-consistent. Policy evaluation, policy improvement, policy iteration, and value iteration are the core algorithms in that family.
  10. Step 10
    Exploration versus exploitation is the trade-off between taking actions that seem best under current knowledge and taking actions that improve knowledge of the environment. In deep RL the problem is harder than in bandits because rewards can be sparse, state spaces are large, and short-term randomness is often not enough to discover long-horizon strategies.
  11. Step 11
    Monte Carlo reinforcement learning estimates values from complete sampled returns rather than from one-step bootstrapped targets. That makes the targets unbiased with respect to the episode return, but usually higher variance than temporal-difference methods.
  12. Step 12
    A multi-armed bandit is the simplest setting for exploration vs exploitation: \( K \) arms, each with unknown reward distribution; pull one per step; minimise cumulative regret. The three canonical strategies — ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling — illustrate optimism, confidence-based exploration, and probability matching respectively, and generalise to contextual bandits and full RL.
  13. Step 13
    Policy iteration alternates between evaluating the current policy and improving it by acting greedily with respect to that value function. It often converges in fewer outer loops than value iteration because each improvement step uses a more fully solved subproblem.
  14. Step 14
    Q-learning is an off-policy reinforcement learning algorithm that learns the optimal action-value function by bootstrapping from a Bellman target over the best next action. Because its update does not require following the current policy, it became a foundational method in both tabular RL and DQN-style deep RL.
  15. Step 15
    REINFORCE is the basic Monte Carlo policy-gradient algorithm that updates parameters by weighting the log-probability of sampled actions by their returns. It is unbiased, but its variance is high, which is why practical methods usually add baselines or critics.
  16. Step 16
    Temporal-difference learning updates value estimates using the Bellman bootstrapped target \( R_t + \gamma V(S_{t+1}) \) rather than a full Monte Carlo return. TD(0) is the one-step instance; SARSA extends it on-policy to action-values; TD(λ) interpolates between TD(0) and Monte Carlo via eligibility traces. TD is the central learning rule behind Q-learning, DQN, and actor-critic.
  17. Step 17
    The Bellman equation recursively expresses the value of a state or state-action pair as immediate reward plus discounted expected future value. It is the backbone of dynamic programming and reinforcement learning because it turns long-horizon return into a local consistency condition.
  18. Step 18
    Value iteration solves a known Markov decision process by repeatedly applying the Bellman optimality backup until the value function converges. Once the optimal value is approximated, a greedy policy with respect to that value is optimal or near-optimal.
  19. Step 19
    Friston's free-energy principle frames perception, learning, and action as minimising a single quantity — variational free energy, a KL-plus-accuracy bound on surprise. Active inference extends the principle to behaviour: agents select actions that minimise expected free energy, simultaneously seeking reward and information. It is a generative-model-based alternative to classical RL with tight links to variational inference, ELBO, and exploration–exploitation.
  20. Step 20
    Decision Transformers cast offline reinforcement learning as conditional sequence modeling: a Transformer predicts the next action from past returns-to-go, states, and actions. This avoids explicit Bellman backups and instead treats policy learning like autoregressive imitation conditioned on the desired future return.
  21. Step 21
    DQN parameterises \( Q(s, a; \theta) \) by a CNN and trains it by Q-learning on mini-batches sampled from a replay buffer, with a slowly-updated target network stabilising the bootstrapped target. Mnih et al. (2015) showed a single DQN architecture learning 49 Atari games from raw pixels at human-level on many — the empirical breakthrough that ignited modern deep RL.
  22. Step 22
    Generalized Advantage Estimation is a family of advantage estimators that interpolates between low-variance temporal-difference updates and high-variance Monte Carlo returns using a parameter lambda. It is widely used because it gives a practical bias-variance tradeoff for policy-gradient training.
  23. Step 23
    GRPO is a policy-optimization method that scores sampled responses relative to others in the same group, using those relative rewards to update the policy. Its appeal is that it can improve reasoning performance while avoiding some of the memory overhead of PPO-style critic training.
  24. Step 24
    Hierarchical reinforcement learning decomposes control across time scales, usually by letting a high-level policy choose skills, options, or subgoals and a low-level policy execute them. This can make sparse-reward and long-horizon problems easier, but only if the learned hierarchy discovers reusable abstractions rather than collapsing back to flat control.
  25. Step 25
    The KL-divergence penalty in RLHF keeps the learned policy close to a reference model while it maximizes reward, usually by subtracting a term proportional to the KL divergence from the objective. This stabilizes training and reduces reward hacking by discouraging the policy from drifting too far from fluent supervised behavior.
  26. Step 26
    KTO is a preference-optimization objective that learns from binary desirable-versus-undesirable labels instead of pairwise rankings. It uses a utility formulation inspired by prospect theory, making it a cheaper alternative when collecting full preference comparisons is too expensive.
  27. Step 27
    Model-based reinforcement learning learns an explicit or latent model of environment dynamics and uses that model for planning, imagination rollouts, or policy optimization. Its main advantage is sample efficiency, while its main failure mode is model bias: the policy can exploit errors in the learned simulator unless planning and training control compounding prediction error.
  28. Step 28
    Monte Carlo Tree Search for LLM reasoning treats partial solution paths as tree nodes, expands candidate continuations, and uses rollouts or value estimates to decide where to search next. It is attractive because it turns one-shot generation into guided search over reasoning trajectories instead of committing immediately to a single chain of thought.
  29. Step 29
    Multi-agent reinforcement learning studies environments where several learning agents interact simultaneously, making each agent's dynamics depend on the evolving policies of the others. The main challenges are non-stationarity, coordination, and credit assignment, which is why centralized training with decentralized execution is a common modern design.
  30. Step 30
    Offline reinforcement learning learns a policy from a fixed logged dataset without further interaction with the environment. Its central difficulty is distribution shift: Bellman backups evaluate actions that are poorly supported by the data, so modern methods either constrain the learned policy to stay near the behavior data or pessimistically down-value unsupported actions.
  31. Step 31
    For a stochastic policy \( \pi_\theta(a \mid s) \), the gradient of expected return \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] \) is \( \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(A \mid S) \cdot Q^{\pi_\theta}(S, A)] \). No gradient flows through the environment dynamics — the theorem turns RL into a stochastic optimisation over policy parameters. It is the foundation of REINFORCE, actor-critic, PPO, TRPO, and GRPO.
  32. Step 32
    Outcome reward models score only the final answer, while process reward models score the intermediate steps of a solution. PRMs provide denser supervision and better guidance for search and long-form reasoning, but they require more fine-grained labels and more complex evaluation.
  33. Step 33
    Process supervision trains a model on the quality of intermediate reasoning steps rather than only on whether the final answer is correct. It improves credit assignment for long solutions and makes verification more local, though collecting reliable step-level labels is expensive.
  34. Step 34
    Proximal Policy Optimization is a policy-gradient algorithm that improves a policy while clipping how far action probabilities can move from the previous policy in one update. In RLHF it is usually paired with a KL penalty so the model gains reward without drifting too far from a reference model.
  35. Step 35
    Reasoning models in the o1 or R1 style are language models trained or prompted to spend extra inference compute on long multi-step reasoning before answering. Their key idea is that better reasoning can come not only from bigger models, but from better search, verification, and credit assignment at inference and post-training time.
  36. Step 36
    Reward hacking occurs when an agent maximizes the formal reward signal while failing at the designer's intended objective. It is a general Goodhart-style failure mode in reinforcement learning: stronger optimization pressure often finds loopholes in the proxy reward faster than humans can patch them.
  37. Step 37
    RLAIF replaces human preference labels with judgments produced by another AI model following a rubric. It scales alignment data collection much more cheaply than RLHF, but it also transfers the biases and blind spots of the judge model into the training signal.
  38. Step 38
    Safe reinforcement learning studies how to optimize long-term return while satisfying safety constraints during training and deployment. The standard formalism is a constrained Markov decision process, where the policy must maximize reward subject to a bound on expected cost, risk, or unsafe-state visitation.
  39. Step 39
    TRPO performs policy updates by solving a constrained optimisation: maximise a surrogate advantage subject to \( \mathbb{E}[D_{\text{KL}}(\pi_\text{old} \| \pi_\theta)] \le \delta \). The KL trust region gives a monotonic-improvement guarantee and prevents collapse under function approximation. TRPO is solved by natural-gradient + line search; PPO is its first-order clipped approximation with near-identical performance at much lower cost.
  40. Step 40
    A learned predictive model of environment dynamics — given a state and action, predict the next state (and reward). World models enable planning, offline rollouts, and sample-efficient RL. Recent systems (Dreamer, MuZero, Genie) show that powerful latent world models scale to games, robotics, and interactive video generation.