Cosine annealing adjusts learning rate cyclically between a maximum and minimum value over time

Image: ManfredKloeppel, CC BY 3.0, via Wikimedia Commons

cosine annealing does: lr = lr_min + 0.5(lr_max - lr_min)(1 + cos(πt/T))

Cosine annealing adjusts learning rate cyclically between a maximum and minimum value over time

Related concepts

Learning to rank

Learning rate cosine annealing formula: learning_rate = learning_rate_initial * 0.5 * (1 + cos(pi * epoch / total_epochs))

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients

Write the Bellman equation for reinforcement learning

Bellman equation: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V(s')]

learning rate warmup does: starts small to avoid early training instability

Learning rate warmup gradually increases the learning rate from zero to a predefined value to stabilize training initially

Adam optimizer weight update with m and v terms

Adam optimizer weight update: w_t = w_{t-1} - α * m_t / (sqrt(v_t) + ε)

Swipe through 100 ML concepts daily