Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions
Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions
Why L1 regularization produces sparse solutions — the diamond corners touch axes
L1 regularization promotes sparsity by penalizing non-zero coefficients, effectively driving some to zero
Why non-convex loss landscapes are hard: many local minima and saddle points
Non-convex landscapes have numerous local minima and saddle points, complicating optimization
Why SGD with momentum escapes local minima better than vanilla SGD
Momentum SGD accumulates velocity, helping to overcome shallow local minima
What AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features
What score matching does: learns the gradient of the log-density without normalizing
Score matching approximates log-density gradients for variational inference without normalization
How does the concept of convexity in optimization relate to finding the global minimum in a non-linear cost function?
Convexity ensures a single global minimum in non-linear cost functions
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews