ResNet
The Degradation Problem
By late 2014 the best image classifiers — VGG (19 layers) and GoogLeNet (22 layers) — had already pushed convolutional networks much deeper than the 8-layer AlexNet of 2012. A natural question was whether depth would keep paying off. The answer turned out to be yes, but only after an architectural fix. Stacking plain layers beyond roughly 20 made both training error and test error rise, not fall.
Degradation is not overfitting
Common misconception: deeper plain networks generalize worse because they overfit. If that were the cause, training error would keep dropping while test error climbed. The paper's Figure 1 shows the opposite: on CIFAR-10 a 56-layer plain convolutional network has strictly higher training error than a 20-layer one (roughly 9% vs 7%, Fig. 1 left). The optimizer cannot even fit the training set as well with more layers. The problem is optimization, not capacity.
Why this was surprising
A deeper network is a strict superset of a shallower one: given a shallow net, a deeper version could match it exactly by setting the added layers to the identity. The existence of this identity construction means no architectural reason prevents the deeper net from matching the shallow net's loss. Yet first-order optimizers fail to find it. Depth breaks SGD, not expressivity.
What the paper is not claiming
The degradation problem is distinct from the vanishing-gradient problem. The paper explicitly rules out vanishing gradients as the cause (§4.1): all plain networks use batch normalization and He initialization, and the authors verify that forward and backward signals propagate with healthy variance. Gradients are fine; the landscape the optimizer lands in is simply hard to descend. Residual learning attacks that landscape directly.
Training error vs. depth for plain networks
Schematic reproduction of the paper's Fig. 1 (left). The 56-layer plain network sits strictly above the 20-layer plain network throughout training. Deeper is worse, even on the training set.