Gradient clipping caps gradient norm to prevent exploding gradients
Image: Ulli Purwin, CC BY 3.0, via Wikimedia Commons
Gradient clipping caps gradient norm to prevent exploding gradients
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
the momentum term does: v_t = βv_{t-1} + ∇L, accumulates gradient direction
Momentum term accelerates convergence in the gradient direction
mixed precision training does: forward in FP16, accumulate gradients in FP32
Mixed precision training: forward in FP16, accumulate gradients in FP32
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews