Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
Image: CHRISTOPHER MACSURAK, CC BY 2.0, via Wikimedia Commons
Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
transformers use LayerNorm not BatchNorm
LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches
L1 vs L2 regularization: L1 gives sparsity (feature selection), L2 gives small weights
L1 regularization: L1 = L2 + sparsity; L2 regularization: L2 = L1 + small weights
384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers
All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
to normalize features: when features have different scales and you use distance-based methods
Normalize features when they have different scales for distance-based methods
Pre-LN
Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews