Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

Image: N509FZ, CC BY-SA 4.0, via Wikimedia Commons

Pre-LN transformers are easier to train

Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

Related concepts

Vanishing gradient problem

Residual connections help by allowing gradient flow through the skip connection

Transformer (deep learning)

Transformers use multi-head attention for contextualizing tokens

the over-smoothing problem is in GNNs: deep GNNs make all node features converge

Over-smoothing in GNNs: Deeper layers cause node features to converge too much, losing unique node identities

the reverse process learns: p_θ(x_{t-1}|x_t)

The reverse process learns: p_θ(x_{t-1}|x_t) — denoising one step at a time

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

Proximal gradient methods for learning

Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator

Swipe through 100 ML concepts daily