
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
Image: Dwayne Reed (talk), CC BY-SA 3.0, via Wikimedia Commons
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
the β₁ and β₂ hyperparameters control in Adam
β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Adam combines momentum and RMSprop: adapts per-parameter learning rates
Adam combines momentum and RMSprop by adapting per-parameter learning rates
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews