Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
Image: Left intentionally blank, Public domain, via Wikimedia Commons
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
Regularization (mathematics)
L1 regularization results in sparse solutions
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
to standardize: when you need zero mean and unit variance for gradient-based optimization
Standardize when zero mean and unit variance are required for gradient-based optimization
natural gradient descent does: preconditions with inverse Fisher matrix
Natural gradient descent optimizes using the Fisher information matrix's inverse as the metric
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Convex optimization
Convex functions have only one global minimum
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews