Experimental Method & RL Fundamentals

Connect reliable experimentation to the dynamic-programming and return-estimation ideas at the core of reinforcement learning.

Estimated time: ~105 min

Study this path with flashcards

7 cards

Study →

Step 1
Missing Data and Imputation
Missing-data methods try to preserve inference when some values are unobserved by modeling why data are missing and how to fill or integrate over the missing entries. The key distinction is between MCAR, MAR, and MNAR, because imputation is far safer when missingness can be treated as conditionally ignorable.
Step 2
Target Leakage vs. Data Leakage
Data leakage is any contamination that lets training or validation use information that would not be available at prediction time. Target leakage is the specific case where features encode the label or a post-outcome proxy for it, so every target leakage problem is data leakage, but not every data leakage problem is target leakage.
Step 3
Ablation Studies and Experimental Control
An ablation study removes or alters one component of a system to measure how much that component actually contributes. Experimental control matters because an ablation is only informative when the comparison keeps everything else fixed, including data, tuning budget, and evaluation protocol.
Step 4
Value Iteration
Value iteration solves a known Markov decision process by repeatedly applying the Bellman optimality backup until the value function converges. Once the optimal value is approximated, a greedy policy with respect to that value is optimal or near-optimal.
Step 5
Policy Iteration
Policy iteration alternates between evaluating the current policy and improving it by acting greedily with respect to that value function. It often converges in fewer outer loops than value iteration because each improvement step uses a more fully solved subproblem.
Step 6
Monte Carlo Reinforcement Learning
Monte Carlo reinforcement learning estimates values from complete sampled returns rather than from one-step bootstrapped targets. That makes the targets unbiased with respect to the episode return, but usually higher variance than temporal-difference methods.
Step 7
Credit Assignment Problem
The credit assignment problem is the problem of determining which earlier actions, states, or internal computations deserve blame or credit for a later outcome. It is hard because rewards and losses are often delayed, sparse, or distributed across many interacting decisions.