gradient clipping does: caps gradient norm to prevent exploding gradients

Gradient clipping caps gradient norm to prevent exploding gradients

Related concepts

to standardize: when you need zero mean and unit variance for gradient-based optimization

Standardize when zero mean and unit variance are required for gradient-based optimization

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

Vanishing gradient problem

Residual connections help by allowing gradient flow through the skip connection

the momentum term does: v_t = βv_{t-1} + ∇L, accumulates gradient direction

Momentum term accelerates convergence in the gradient direction

mixed precision training does: forward in FP16, accumulate gradients in FP32

Mixed precision training: forward in FP16, accumulate gradients in FP32

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

Swipe through 100 ML concepts daily