The reverse process learns: p_θ(x_{t-1}|x_t) — denoising one step at a time
Image: XiaYZ2023, CC BY-SA 4.0, via Wikimedia Commons
The reverse process learns: p_θ(x_{t-1}|x_t) — denoising one step at a time
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
Pre-LN transformers are easier to train
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
denoising score matching does: learns to denoise, which equals learning the score
Denoising score matching learns to denoise by estimating the score (gradient of log probability) of data distributions
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
Langevin dynamics does: adds noise to gradient descent to sample from a distribution
Langevin dynamics adds noise to gradient descent to sample from a distribution
Diffusion model
q(x_t|x_{t-1}) adds Gaussian noise at each step
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews