Gradient checkpointing trades off computation time for memory savings by recomputing activations
Image: Aditya Suseno, CC0, via Wikimedia Commons
Gradient checkpointing trades off computation time for memory savings by recomputing activations
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
Overlapping subproblems
Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
Flashbulb memory
Flashbulb memories are vivid but not always accurate
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews