Parallel transport moves vectors along a curve while preserving their properties
Image: BYD Colombia, CC BY-SA 2.5, via Wikimedia Commons
Parallel transport moves vectors along a curve while preserving their properties
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
Pre-LN transformers are easier to train
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
most transformer operations are memory-bound, not compute-bound
Most transformer operations are memory-bound due to large model sizes requiring extensive data transfer
message passing does in GNNs: each node aggregates features from its neighbors
Each node aggregates features from its neighbors using message passing
SGD with momentum escapes local minima better than vanilla SGD
SGD with momentum adds velocity to escape shallow local minima faster
non-convex loss landscapes are hard: many local minima and saddle points
Non-convex loss landscapes are hard due to many local minima and saddle points
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews