L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights

Image: Ansel Adams, Public domain, via Wikimedia Commons

L1 vs L2 regularization: L1 gives sparsity (feature selection), L2 gives small weights

L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights

Related concepts

LASSO uses L1 to do feature selection by driving coefficients to exactly zero

LASSO minimizes the cost function with L1 penalty, driving some coefficients to zero for feature selection

Proximal gradient methods for learning

Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator

to normalize features: when features have different scales and you use distance-based methods

Normalize features when they have different scales for distance-based methods

Regularization (mathematics)

L1 regularization results in sparse solutions

Batch norm vs layer norm: BN across batch, LN across features

Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences

Ridge regression uses L2 to shrink coefficients without eliminating them

Ridge regression minimizes the sum of squared residuals plus L2 penalty λ∑β²

Swipe through 100 ML concepts daily