
Dropout randomly deactivates neurons during training, simulating an ensemble of subnetworks, thus preventing co-adaptation and improving generalization
Image: Guss, CC BY-SA 4.0, via Wikimedia Commons
Dropout randomly deactivates neurons during training, simulating an ensemble of subnetworks, thus preventing co-adaptation and improving generalization
Dropout (neural networks)
Dropout randomly sets neuron inputs/outputs to zero during training
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews