AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients
Image: National Oceanic and Atmospheric Administration, Public domain, via Wikimedia Commons
AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
RMSprop fixes about AdaGrad: uses exponential moving average instead of sum
RMSprop uses an exponentially decaying average of squared gradients, unlike AdaGrad's cumulative sum
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
cosine annealing does: lr = lr_min + 0.5(lr_max - lr_min)(1 + cos(πt/T))
Cosine annealing adjusts learning rate cyclically between a maximum and minimum value over time
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews