Normalize features when they have different scales for distance-based methods
Image: Barry Dale Gilfry from Colorado, CC BY-SA 2.0, via Wikimedia Commons
Normalize features when they have different scales for distance-based methods
L1 vs L2 regularization: L1 gives sparsity (feature selection), L2 gives small weights
L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
Batch norm vs layer norm: BN across batch, LN across features
Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
List of algorithms
Cosine similarity measures the angle between vectors, not their magnitude
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews