
Ridge regression minimizes the sum of squared residuals plus L2 penalty λ∑β²
Image: NoMore201, CC BY-SA 4.0, via Wikimedia Commons
Ridge regression minimizes the sum of squared residuals plus L2 penalty λ∑β²
LASSO uses L1 to do feature selection by driving coefficients to exactly zero
LASSO minimizes the cost function with L1 penalty, driving some coefficients to zero for feature selection
L1 vs L2 regularization: L1 gives sparsity (feature selection), L2 gives small weights
L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
Regularization (mathematics)
L1 regularization results in sparse solutions
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
Rate-distortion theory: minimum bits to represent data within distortion D
Rate-distortion theory: minimum bits to represent data within distortion D = R(D)
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews