DPO simplifies: removes explicit reward model, trains directly on preferences
Image: LERK, CC BY-SA 4.0, via Wikimedia Commons
DPO simplifies: removes explicit reward model, trains directly on preferences
Reinforcement learning from human feedback
RLHF optimizes a reward model trained on human preference pairs
Greedy vs dynamic programming: greedy makes locally optimal choices, DP considers all subproblems
Greedy: locally optimal choices; DP: considers all subproblems
classifier-free guidance does: interpolates between conditional and unconditional generation
"Classifies samples as either conditioned or unconditioned, guiding generation towards desired outcomes."
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
the back-door criterion identifies: sufficient adjustment sets for causal estimation
The back-door criterion identifies sufficient adjustment sets for causal estimation
score matching does: learns the gradient of the log-density without normalizing
Matching score learns gradient of log-density without normalizing
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews