Tag: regularization

11 topic(s)

L1 vs. L2 NormsThe L1 norm sums absolute values and tends to promote sparsity when used as a penalty, while the L2 norm measures Euclidean length and tends to shrink weights smoothly without zeroing many of them out. That difference is why L1 is associated with feature selection and L2 with stable shrinkage.
DropoutDropout regularizes a neural network by randomly zeroing activations during training, which prevents units from co-adapting too strongly. At test time the full network is used with rescaled activations, making dropout behave like an inexpensive ensemble-style regularizer.
GeneralizationGeneralization is a model's ability to perform well on unseen data from the same underlying distribution as its training data. It is the real goal of learning, because low training error alone can come from memorization rather than useful structure.
RegularizationRegularization is any technique that biases learning toward simpler, more stable, or less overfit solutions. It can appear as an explicit penalty such as weight decay or as an implicit training choice such as data augmentation, dropout, or early stopping.
L1 regularization (Lasso)L1 regularization adds a penalty proportional to the sum of absolute parameter values, encouraging many coefficients to become exactly zero. That sparsity makes Lasso useful when feature selection is part of the goal, not just shrinkage.
L2 regularization (Ridge/Weight Decay)L2 regularization adds a penalty proportional to the sum of squared parameter values, shrinking weights toward zero without usually making them exactly sparse. In plain SGD it is equivalent to weight decay and is widely used because it improves stability and reduces variance.
Early stoppingEarly stopping regularizes training by halting optimization when validation performance stops improving and keeping the best checkpoint seen so far. It works because prolonged optimization can eventually fit noise or idiosyncrasies of the training set rather than signal.
OverfittingOverfitting happens when a model fits patterns specific to the training set, including noise, better than it captures the underlying data-generating structure. The usual symptom is low training error paired with substantially worse validation or test error.
Batch NormalizationBatch normalization normalizes activations using mini-batch mean and variance, then applies learned scale and shift parameters. It stabilizes optimization and enables deeper networks, but its behavior differs between training and inference because it relies on running statistics.
Weight DecayWeight decay shrinks parameters toward zero by multiplying them by a factor slightly below 1 on each optimizer step. In plain SGD it is equivalent to L2 regularization, but in adaptive optimizers the decoupled AdamW form is usually preferred.
Sparse Representations in Deep NetsA sparse representation is one where only a small fraction of units are active for any given input. Deep nets often develop sparsity through ReLU-like nonlinearities or explicit penalties, which can improve efficiency, feature selectivity, and sometimes interpretability.