Adam combines momentum and RMSprop: adapts per-parameter learning rates

Adam combines momentum and RMSprop by adapting per-parameter learning rates

Ask Claude to explain

Related concepts

Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning

Adam adjusts learning rates per-parameter, SGD generalizes better with tuning

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

the β₁ and β₂ hyperparameters control in Adam

β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer

weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers

Weight initialization stabilizes learning by maintaining consistent activation variance

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

LAMB optimizer does: layer-wise adaptive learning rates for large batch training

LAMB optimizer adjusts learning rates layer-wise for large batch training

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews