RMSprop fixes about AdaGrad: uses exponential moving average instead of sum

RMSprop uses an exponentially decaying average of squared gradients, unlike AdaGrad's cumulative sum

Image: Brown, J., O.J. Ferrians, Jr., J.A. Heginbottom, and E.S. Melnikov. 1998, revised February 2001. Circum-arctic map of pe, Public domain, via Wikimedia Commons

RMSprop fixes about AdaGrad: uses exponential moving average instead of sum

RMSprop uses an exponentially decaying average of squared gradients, unlike AdaGrad's cumulative sum

Ask Claude to explain

Related concepts

AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

the β₁ and β₂ hyperparameters control in Adam

β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer

to standardize: when you need zero mean and unit variance for gradient-based optimization

Standardize when zero mean and unit variance are required for gradient-based optimization

mixed precision training does: forward in FP16, accumulate gradients in FP32

Mixed precision training: forward in FP16, accumulate gradients in FP32

second-order methods (Newton's) converge faster but are expensive: O(n³) per step

Second-order methods converge faster due to quadratic convergence but are expensive due to O(n³) per iteration

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews