transformers use LayerNorm not BatchNorm

LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches

Image: Fgpacini, CC BY-SA 4.0, via Wikimedia Commons

transformers use LayerNorm not BatchNorm

LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches

Related concepts

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews