Generalization & Learning Theory

Rare but high-value theoretical foundations. Move from PAC learning and VC dimension to modern insights like double descent and neural collapse.

Estimated time: ~120 min

Study this path with flashcards

5 cards

Study →

Step 1
PAC Learning (Probably Approximately Correct)
PAC learning formalizes what it means for a hypothesis class to be learnable: with enough samples, an algorithm should return a hypothesis whose error is small with high probability. It is foundational because sample complexity and model capacity can then be expressed as rigorous guarantees instead of heuristics.
Step 2
VC Dimension and Shattering
The Vapnik–Chervonenkis dimension of a hypothesis class \( \mathcal{H} \) is the largest number of points \( \mathcal{H} \) can shatter — label in every possible way. It controls PAC generalisation: if \( \text{VC}(\mathcal{H}) = d \), then with probability \( 1 - \delta \) all \( h \in \mathcal{H} \) satisfy \( L(h) \le \hat L(h) + O(\sqrt{d/n}) \). VC dimension explains why linear classifiers in \( \mathbb{R}^d \) need \( \Omega(d) \) examples and why simple hypothesis classes generalise.
Step 3
Rademacher Complexity
The empirical Rademacher complexity of a function class \( \mathcal{F} \) on data \( S = (z_1, \dots, z_n) \) is \( \hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma[\sup_{f \in \mathcal{F}} \tfrac{1}{n}\sum_i \sigma_i f(z_i)] \) — the expected ability of \( \mathcal{F} \) to correlate with random \( \pm 1 \) signs. It is the data-dependent workhorse of modern generalisation bounds, usually tighter than VC, and gives direct norm-based bounds for deep networks.
Step 4
PAC-Bayes Generalization Bounds
PAC-Bayes bounds the generalization gap of a stochastic classifier \( Q \) by a Kullback–Leibler term against a data-independent prior \( P \): with probability \( 1 - \delta \), \( \mathbb{E}_{h\sim Q}[R(h)] \le \mathbb{E}_{h\sim Q}[\hat R(h)] + O(\sqrt{\text{KL}(Q\|P)/n}) \). Used for non-vacuous bounds on overparameterised networks.
Step 5
Double Descent
Double descent is the phenomenon in which test error first follows the classical U-shape with increasing model size, then improves again once the model passes the interpolation threshold. It matters because it shows that the old bias-variance story is incomplete in highly overparameterized regimes.