DPO simplifies: removes the explicit reward model, trains directly on preferences

DPO simplifies: removes explicit reward model, trains directly on preferences

Related concepts

Reinforcement learning from human feedback

RLHF optimizes a reward model trained on human preference pairs

Greedy vs dynamic programming: greedy makes locally optimal choices, DP considers all subproblems

Greedy: locally optimal choices; DP: considers all subproblems

classifier-free guidance does: interpolates between conditional and unconditional generation

"Classifies samples as either conditioned or unconditioned, guiding generation towards desired outcomes."

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

the back-door criterion identifies: sufficient adjustment sets for causal estimation

The back-door criterion identifies sufficient adjustment sets for causal estimation

score matching does: learns the gradient of the log-density without normalizing

Matching score learns gradient of log-density without normalizing

Swipe through 100 ML concepts daily