Adam combines momentum and RMSprop by adapting per-parameter learning rates
Image: Holger.Ellgaard, CC BY-SA 3.0, via Wikimedia Commons
Adam combines momentum and RMSprop by adapting per-parameter learning rates
Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning
Adam adjusts learning rates per-parameter, SGD generalizes better with tuning
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
the β₁ and β₂ hyperparameters control in Adam
β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
LAMB optimizer does: layer-wise adaptive learning rates for large batch training
LAMB optimizer adjusts learning rates layer-wise for large batch training
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews