RL, Causality & History Anchors

Anchor reinforcement learning, causal inference, and two landmark papers that shaped modern deep learning and transformers.

Estimated time: ~90 min

Study this path with flashcards

6 cards

Study →

Step 1
Markov Decision Process (MDP)
A Markov decision process formalizes sequential decision-making with states, actions, transitions, rewards, and a discount factor. Its key assumption is that the next-state and reward distribution depends only on the current state and action, which makes Bellman-style planning possible.
Step 2
Dynamic Programming for RL
Dynamic programming solves an MDP with a known model by repeatedly applying Bellman updates until values or policies become self-consistent. Policy evaluation, policy improvement, policy iteration, and value iteration are the core algorithms in that family.
Step 3
Potential Outcomes Framework
The potential outcomes framework defines causal effects by comparing the outcomes a unit would have under different treatments. Because only one of those potential outcomes is observed for any given unit, causal inference is fundamentally about identifying missing counterfactuals under defensible assumptions.
Step 4
Confounding, Colliders, and Simpson’s Paradox
Confounders create misleading associations because they affect both treatment and outcome, while colliders create bias when you condition on them. Simpson’s paradox is the visible symptom that aggregate and stratified associations can reverse direction when the underlying causal structure is ignored.
Step 5
Attention Is All You Need
“Attention Is All You Need” introduced the Transformer: a sequence model built around self-attention instead of recurrence or convolution. The paper mattered because it showed that attention-based, highly parallel sequence modeling could outperform recurrent seq2seq systems and set the template for modern LLMs.
Step 6
AlexNet
AlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.