Weight initialization stabilizes learning by maintaining consistent activation variance
Image: Billy69150 (voir les conditions d'utilisation / see licensing below), CC BY-SA 4.0, via Wikimedia Commons
Weight initialization stabilizes learning by maintaining consistent activation variance
AWQ does differently
AWQ selectively retains weights crucial for model performance, unlike traditional quantization
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
gradient checkpointing trades: recomputes activations to save memory
Gradient checkpointing trades off computation time for memory savings by recomputing activations
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
Adam combines momentum and RMSprop: adapts per-parameter learning rates
Adam combines momentum and RMSprop by adapting per-parameter learning rates
ill-conditioned matrices cause numerical instability: small input changes → large output changes
Ill-conditioned matrices amplify input perturbations, leading to significant output variability
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews