β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer

Image: Newspaper Enterprise Association, Public domain, via Wikimedia Commons

the β₁ and β₂ hyperparameters control in Adam

β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer

Ask Claude to explain

Related concepts

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

Adam combines momentum and RMSprop: adapts per-parameter learning rates

Adam combines momentum and RMSprop by adapting per-parameter learning rates

Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning

Adam adjusts learning rates per-parameter, SGD generalizes better with tuning

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews