Mixed precision training: forward in FP16, accumulate gradients in FP32
Image: Enrique Íñiguez Rodríguez (Qoan), CC BY-SA 4.0, via Wikimedia Commons
Mixed precision training: forward in FP16, accumulate gradients in FP32
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
to use F1 score: when classes are imbalanced and both FP and FN matter
Use F1 score when classes are imbalanced and both FP and FN matter
gradient clipping does: caps gradient norm to prevent exploding gradients
Gradient clipping caps gradient norm to prevent exploding gradients
mixup does: trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j
Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews