Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions

Why proximal gradient descent is needed for L1 optimization

Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions

Related concepts

Why L1 regularization produces sparse solutions — the diamond corners touch axes

L1 regularization promotes sparsity by penalizing non-zero coefficients, effectively driving some to zero

Why non-convex loss landscapes are hard: many local minima and saddle points

Non-convex landscapes have numerous local minima and saddle points, complicating optimization

Why SGD with momentum escapes local minima better than vanilla SGD

Momentum SGD accumulates velocity, helping to overcome shallow local minima

What AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features

What score matching does: learns the gradient of the log-density without normalizing

Score matching approximates log-density gradients for variational inference without normalization

How does the concept of convexity in optimization relate to finding the global minimum in a non-linear cost function?

Convexity ensures a single global minimum in non-linear cost functions

Swipe through 100 ML concepts daily