Momentum term accelerates convergence in the gradient direction
Image: Captain Galaxy, CC BY 4.0, via Wikimedia Commons
Momentum term accelerates convergence in the gradient direction
SGD with momentum escapes local minima better than vanilla SGD
SGD with momentum adds velocity to escape shallow local minima faster
Gradient
Gradient points uphill in the direction of steepest increase of f
gradient clipping does: caps gradient norm to prevent exploding gradients
Gradient clipping caps gradient norm to prevent exploding gradients
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
Langevin dynamics does: adds noise to gradient descent to sample from a distribution
Langevin dynamics adds noise to gradient descent to sample from a distribution
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews