SGD with momentum escapes local minima better than vanilla SGD

SGD with momentum adds velocity to escape shallow local minima faster

Related concepts

non-convex loss landscapes are hard: many local minima and saddle points

Non-convex loss landscapes are hard due to many local minima and saddle points

saddle points are more common than local minima in high dimensions

Saddle points arise due to mixed partial derivatives being zero, leading to more complex curvature in high dimensions

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

Vanishing gradient problem

Residual connections help by allowing gradient flow through the skip connection

Convex optimization

Convex functions have only one global minimum

the momentum term does: v_t = βv_{t-1} + ∇L, accumulates gradient direction

Momentum term accelerates convergence in the gradient direction

Swipe through 100 ML concepts daily