
RMSprop uses an exponentially decaying average of squared gradients, unlike AdaGrad's cumulative sum
Image: Brown, J., O.J. Ferrians, Jr., J.A. Heginbottom, and E.S. Melnikov. 1998, revised February 2001. Circum-arctic map of pe, Public domain, via Wikimedia Commons
RMSprop uses an exponentially decaying average of squared gradients, unlike AdaGrad's cumulative sum
AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
the β₁ and β₂ hyperparameters control in Adam
β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
mixed precision training does: forward in FP16, accumulate gradients in FP32
Mixed precision training: forward in FP16, accumulate gradients in FP32
second-order methods (Newton's) converge faster but are expensive: O(n³) per step
Second-order methods converge faster due to quadratic convergence but are expensive due to O(n³) per iteration
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews