Reinforcement learning from human feedback

RLHF optimizes a reward model trained on human preference pairs

Reinforcement learning from human feedback

RLHF optimizes a reward model trained on human preference pairs

Reinforcement learning from human feedback (RLHF) is a technique that aligns an intelligent agent with human preferences by training a reward model. This reward model is initially trained in a supervised manner to predict the quality of responses based on human rankings. Once trained, it serves as a reward function to guide the optimization of an agent's policy.

Example

In natural language processing, RLHF can be used to train conversational agents by having human annotators rank responses, and then using those rankings to train a reward model that helps improve the agent's conversation skills.

Understanding RLHF is crucial for developing AI systems that better align with human values and preferences.

Related concepts

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews