mixed precision training does: forward in FP16, accumulate gradients in FP32

Mixed precision training: forward in FP16, accumulate gradients in FP32

Related concepts

to standardize: when you need zero mean and unit variance for gradient-based optimization

Standardize when zero mean and unit variance are required for gradient-based optimization

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

to use F1 score: when classes are imbalanced and both FP and FN matter

Use F1 score when classes are imbalanced and both FP and FN matter

gradient clipping does: caps gradient norm to prevent exploding gradients

Gradient clipping caps gradient norm to prevent exploding gradients

mixup does: trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j

Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

Swipe through 100 ML concepts daily