Learning Theory & Evaluation

Connect risk minimization, validation protocol, evaluation metrics, and deployment-shift thinking so model selection and measurement line up with generalization.

Estimated time: ~90 min

Study this path with flashcards

8 cards

Study →

Step 1
Empirical Risk Minimization (ERM)
Empirical risk minimization chooses the model with the smallest average training loss. It is the default principle behind most supervised learning, but it must be paired with capacity control or held-out evaluation because low training loss alone does not guarantee generalization.
Step 2
Structural Risk Minimization (SRM)
Structural risk minimization extends empirical risk minimization by balancing training fit against model complexity. It is the learning-theoretic principle behind regularization, margin control, and choosing among hypothesis classes of different capacity.
Step 3
Train/Validation/Test Split
A train/validation/test split separates fitting, model selection, and final evaluation into different datasets. The test set is kept untouched until the end so it remains a credible estimate of out-of-sample performance.
Step 4
k-Fold Cross-Validation
k-fold cross-validation rotates a held-out fold through the dataset so every example is used for validation once and training the other times. It uses limited data efficiently for model selection, but it costs multiple training runs and must keep preprocessing inside each fold.
Step 5
Confusion Matrix
A confusion matrix counts predicted labels against true labels. In binary classification it yields the four basic counts—true positives, false positives, true negatives, and false negatives—from which most common thresholded metrics are derived.
Step 6
Precision-Recall Curve and Average Precision
A precision-recall curve shows how precision and recall trade off as the decision threshold moves through a ranked list of predictions. Average precision summarizes that curve and is especially informative when the positive class is rare.
Step 7
Feature Scaling and Standardization
Feature scaling rescales input dimensions to comparable magnitudes, while standardization specifically subtracts the training mean and divides by the training standard deviation. It matters because optimization, distances, and margins can otherwise be dominated by whichever feature uses the largest units.
Step 8
Distribution Shift & Dataset Shift
Distribution shift occurs when the joint distribution seen at deployment differs from the one used for training or validation. The main cases are covariate shift, label shift, and concept shift; each breaks generalization in a different way and therefore requires different detection and mitigation strategies.