L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights
Image: Ansel Adams, Public domain, via Wikimedia Commons
L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights
LASSO uses L1 to do feature selection by driving coefficients to exactly zero
LASSO minimizes the cost function with L1 penalty, driving some coefficients to zero for feature selection
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
to normalize features: when features have different scales and you use distance-based methods
Normalize features when they have different scales for distance-based methods
Regularization (mathematics)
L1 regularization results in sparse solutions
Batch norm vs layer norm: BN across batch, LN across features
Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
Ridge regression uses L2 to shrink coefficients without eliminating them
Ridge regression minimizes the sum of squared residuals plus L2 penalty λ∑β²
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews