Pre-LN

Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention

Related concepts

Pre-LN transformers are easier to train

Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

Batch norm vs layer norm: BN across batch, LN across features

Batch norm (BN) normalizes across batch, layer norm (LN) normalizes across features; LN handles variable-length sequences

transformers use LayerNorm not BatchNorm

LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

to use an RNN/LSTM: for sequential data where order matters (mostly replaced by transformers)

Use RNN/LSTM for sequential data where order matters (mostly replaced by transformers)

384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers

All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers

Swipe through 100 ML concepts daily