AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

Image: Unknown authorUnknown author, CC BY 4.0, via Wikimedia Commons

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

Related concepts

AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

to standardize: when you need zero mean and unit variance for gradient-based optimization

Standardize when zero mean and unit variance are required for gradient-based optimization

Proximal gradient methods for learning

Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator

Swipe through 100 ML concepts daily