RLHF optimizes a reward model trained on human preference pairs

Reinforcement learning from human feedback

RLHF optimizes a reward model trained on human preference pairs

Reinforcement learning from human feedback (RLHF) is a technique that aligns an intelligent agent with human preferences by training a reward model. This reward model is initially trained in a supervised manner to predict the quality of responses based on human rankings. Once trained, it serves as a reward function to guide the optimization of an agent's policy.

Example

In natural language processing, RLHF can be used to train conversational agents by having human annotators rank responses, and then using those rankings to train a reward model that helps improve the agent's conversation skills.

Understanding RLHF is crucial for developing AI systems that better align with human values and preferences.

Related concepts

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

DPO simplifies: removes the explicit reward model, trains directly on preferences

DPO simplifies: removes explicit reward model, trains directly on preferences

score matching does: learns the gradient of the log-density without normalizing

Matching score learns gradient of log-density without normalizing

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

the back-door criterion identifies: sufficient adjustment sets for causal estimation

The back-door criterion identifies sufficient adjustment sets for causal estimation

Reasoning model

RLMs excel in logic, math, and programming tasks

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews