Adam adjusts learning rates per-parameter, SGD generalizes better with tuning
Image: Official GDC, CC BY 2.0, via Wikimedia Commons
Adam adjusts learning rates per-parameter, SGD generalizes better with tuning
Adam combines momentum and RMSprop: adapts per-parameter learning rates
Adam combines momentum and RMSprop by adapting per-parameter learning rates
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
the β₁ and β₂ hyperparameters control in Adam
β₁ controls the exponential decay rate of the first moment estimates; β₂ controls the exponential decay rate of the second moment estimates in Adam optimizer
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
data augmentation does for generalization: artificially expands training set
Data augmentation artificially expands the training set, enhancing model generalization
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews