Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
Image: N509FZ, CC BY-SA 4.0, via Wikimedia Commons
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
Transformer (deep learning)
Transformers use multi-head attention for contextualizing tokens
the over-smoothing problem is in GNNs: deep GNNs make all node features converge
Over-smoothing in GNNs: Deeper layers cause node features to converge too much, losing unique node identities
the reverse process learns: p_θ(x_{t-1}|x_t)
The reverse process learns: p_θ(x_{t-1}|x_t) — denoising one step at a time
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews