parallel transport does: moves vectors along a curve while preserving their properties

Parallel transport moves vectors along a curve while preserving their properties

Related concepts

Vanishing gradient problem

Residual connections help by allowing gradient flow through the skip connection

Pre-LN transformers are easier to train

Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

most transformer operations are memory-bound, not compute-bound

Most transformer operations are memory-bound due to large model sizes requiring extensive data transfer

message passing does in GNNs: each node aggregates features from its neighbors

Each node aggregates features from its neighbors using message passing

SGD with momentum escapes local minima better than vanilla SGD

SGD with momentum adds velocity to escape shallow local minima faster

non-convex loss landscapes are hard: many local minima and saddle points

Non-convex loss landscapes are hard due to many local minima and saddle points

Swipe through 100 ML concepts daily