
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Image: Masur, CC BY-SA 3.0, via Wikimedia Commons
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
gradient checkpointing trades: recomputes activations to save memory
Gradient checkpointing trades off computation time for memory savings by recomputing activations
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
LAMB optimizer does: layer-wise adaptive learning rates for large batch training
LAMB optimizer adjusts learning rates layer-wise for large batch training
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews