Learning rate warmup gradually increases the learning rate from zero to a predefined value to stabilize training initially
Image: Prime Minister's Office, GODL-India, via Wikimedia Commons
Learning rate warmup gradually increases the learning rate from zero to a predefined value to stabilize training initially
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning
Adam adjusts learning rates per-parameter, SGD generalizes better with tuning
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients
LAMB optimizer does: layer-wise adaptive learning rates for large batch training
LAMB optimizer adjusts learning rates layer-wise for large batch training
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews