Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
Image: Helmut Schütz, CC BY-SA 2.5, via Wikimedia Commons
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
data augmentation does for generalization: artificially expands training set
Data augmentation artificially expands the training set, enhancing model generalization
LAMB optimizer does: layer-wise adaptive learning rates for large batch training
LAMB optimizer adjusts learning rates layer-wise for large batch training
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
non-convex loss landscapes are hard: many local minima and saddle points
Non-convex loss landscapes are hard due to many local minima and saddle points
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews