Tag: evaluation

11 topic(s)

CalibrationCalibration measures whether predicted probabilities match observed frequencies, so events predicted at 70% should occur about 70% of the time. A model can be accurate but poorly calibrated if its confidence is systematically too high or too low.
Temperature ScalingTemperature scaling calibrates a classifier by dividing logits by a learned scalar temperature before the softmax. It often improves probability calibration on a validation set without changing the model’s ranking of classes.
Hypothesis Testing, p-values, and Statistical PowerA hypothesis test compares a null \( H_0 \) to an alternative \( H_1 \) by computing a test statistic and its tail probability under \( H_0 \) — the p-value. Statistical power is \( 1 - \beta \), the probability of rejecting \( H_0 \) when \( H_1 \) is true. For ML evaluation, these are the tools that separate "this model is better" from "this split was lucky".
Goodhart's Law and Specification GamingGoodhart's Law says that once a measure becomes a target, optimizing it can break its usefulness as a proxy. In AI safety and ML systems this appears as specification gaming: the model finds ways to maximize the metric without achieving the intended goal.
Generative Model Evaluation (FID, IS, and their limits)Fréchet Inception Distance (FID) and Inception Score (IS) are the standard automated metrics for image generative models; both rely on Inception-v3 features and have well-known biases. Modern T2I evaluation supplements them with CLIPScore, prompt-adherence benchmarks (T2I-CompBench, GenEval), human-preference Elo (ImageReward, HPS), and likelihood / NLL where applicable.
Multiple Hypothesis Testing and False Discovery RateMultiple hypothesis testing asks how to control false positives when many tests are run at once. False discovery rate control, especially the Benjamini–Hochberg procedure, limits the expected fraction of rejected hypotheses that are actually null and is usually less conservative than family-wise error control.
Likelihood Ratio TestsA likelihood ratio test compares how well two nested statistical models explain the same data by taking the ratio of their maximized likelihoods. Large likelihood-ratio statistics indicate that the larger model fits substantially better than the restricted one, and under regularity conditions the test statistic is asymptotically chi-squared.
Bootstrap Confidence IntervalsBootstrap confidence intervals estimate uncertainty by resampling the observed dataset with replacement and recomputing the statistic many times. They are useful when analytic standard errors are awkward, but they inherit the sample's biases and can fail when the original sample is too small or unrepresentative.
Brier ScoreThe Brier score measures the mean squared error of probabilistic predictions, so it rewards both correctness and calibration. Lower is better, and unlike accuracy it penalizes a confidently wrong 0.99 prediction much more than a cautious 0.6 prediction.
Target Leakage vs. Data LeakageData leakage is any contamination that lets training or validation use information that would not be available at prediction time. Target leakage is the specific case where features encode the label or a post-outcome proxy for it, so every target leakage problem is data leakage, but not every data leakage problem is target leakage.
Ablation Studies and Experimental ControlAn ablation study removes or alters one component of a system to measure how much that component actually contributes. Experimental control matters because an ablation is only informative when the comparison keeps everything else fixed, including data, tuning budget, and evaluation protocol.