the reverse process learns: p_θ(x_{t-1}|x_t)

The reverse process learns: p_θ(x_{t-1}|x_t) — denoising one step at a time

Related concepts

AdaGrad's learning rate decays to zero

AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially

Pre-LN transformers are easier to train

Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

denoising score matching does: learns to denoise, which equals learning the score

Denoising score matching learns to denoise by estimating the score (gradient of log probability) of data distributions

Adam has bias correction: divides by (1-β^t) in early steps

Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients

Langevin dynamics does: adds noise to gradient descent to sample from a distribution

Langevin dynamics adds noise to gradient descent to sample from a distribution

Diffusion model

q(x_t|x_{t-1}) adds Gaussian noise at each step

Swipe through 100 ML concepts daily