the momentum term does: v_t = βv_{t-1} + ∇L, accumulates gradient direction

Momentum term accelerates convergence in the gradient direction

Related concepts

SGD with momentum escapes local minima better than vanilla SGD

SGD with momentum adds velocity to escape shallow local minima faster

Gradient

Gradient points uphill in the direction of steepest increase of f

gradient clipping does: caps gradient norm to prevent exploding gradients

Gradient clipping caps gradient norm to prevent exploding gradients

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

Langevin dynamics does: adds noise to gradient descent to sample from a distribution

Langevin dynamics adds noise to gradient descent to sample from a distribution

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

Swipe through 100 ML concepts daily