Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention
Image: Mike Cai Chen, CC0, via Wikimedia Commons
Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention
Pre-LN transformers are easier to train
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
Batch norm vs layer norm: BN across batch, LN across features
Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences
transformers use LayerNorm not BatchNorm
LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
to use an RNN/LSTM: for sequential data where order matters (mostly replaced by transformers)
Use RNN/LSTM for sequential data where order matters (mostly replaced by transformers)
384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers
All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews