
SGD with momentum adds velocity to escape shallow local minima faster
Image: Gnsin, CC BY-SA 3.0, via Wikimedia Commons
SGD with momentum adds velocity to escape shallow local minima faster
non-convex loss landscapes are hard: many local minima and saddle points
Non-convex loss landscapes are hard due to many local minima and saddle points
saddle points are more common than local minima in high dimensions
Saddle points arise due to mixed partial derivatives being zero, leading to more complex curvature in high dimensions
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
Convex optimization
Convex functions have only one global minimum
the momentum term does: v_t = βv_{t-1} + ∇L, accumulates gradient direction
Momentum term accelerates convergence in the gradient direction
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews