Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences

Image: CHRISTOPHER MACSURAK, CC BY 2.0, via Wikimedia Commons

Batch norm vs layer norm: BN across batch, LN across features

Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences

Related concepts

transformers use LayerNorm not BatchNorm

LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches

L1 vs L2 regularization: L1 gives sparsity (feature selection), L2 gives small weights

L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights

384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers

All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

to normalize features: when features have different scales and you use distance-based methods

Normalize features when they have different scales for distance-based methods

Pre-LN

Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention

Swipe through 100 ML concepts daily