Natural gradient descent optimizes using the Fisher information matrix's inverse as the metric
Image: ChristianT, CC BY-SA 3.0, via Wikimedia Commons
Natural gradient descent optimizes using the Fisher information matrix's inverse as the metric
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
Fisher information
Fisher information measures information about unknown parameters
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
score matching does: learns the gradient of the log-density without normalizing
Matching score learns gradient of log-density without normalizing
Ordinary least squares
OLS minimizes squared differences
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews