Momentum SGD accumulates velocity, helping to overcome shallow local minima

Why SGD with momentum escapes local minima better than vanilla SGD

Momentum SGD accumulates velocity, helping to overcome shallow local minima

Ask Claude to explain

Related concepts

Why proximal gradient descent is needed for L1 optimization

Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions

Why non-convex loss landscapes are hard: many local minima and saddle points

Non-convex landscapes have numerous local minima and saddle points, complicating optimization

Why L1 regularization produces sparse solutions — the diamond corners touch axes

L1 regularization promotes sparsity by penalizing non-zero coefficients, effectively driving some to zero

How does batch normalization contribute to training deep neural networks: by normalizing input features within each batch to have zero mean and unit variance to accelerate convergence and improve generalization?

Batch normalization stabilizes and accelerates deep learning training by normalizing input features

What AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features

Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates

Greedy decoding selects one token, while beam search retains multiple candidates

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews