What is Reinforcement Learning from Human Feedback?

Generative AI

Reinforcement Learning from Human Feedback

RLHF is a training technique that uses human preferences to fine-tune AI models, aligning their outputs with human values and expectations. RLHF is key to making language models helpful, harmless, and honest.

Understanding Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns large language models with human preferences by using human evaluations to train a reward model, which then guides the language model's optimization through reinforcement learning. The process typically involves three stages: supervised fine-tuning on high-quality demonstrations, training a reward model on human comparisons of model outputs, and optimizing the language model against this reward model using algorithms like Proximal Policy Optimization (PPO). RLHF was instrumental in transforming GPT-3 into ChatGPT, dramatically improving the model's helpfulness, honesty, and safety. The technique addresses the gap between the next-token prediction objective used in pre-training and the actual qualities humans value in AI responses. Alternatives and extensions include Direct Preference Optimization (DPO), which simplifies the pipeline by eliminating the separate reward model, and Constitutional AI, which uses AI feedback alongside human feedback to scale the alignment process.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Representation Learning

Back to full glossary

Reinforcement Learning from Human Feedback

Understanding Reinforcement Learning from Human Feedback

Is AI recommending your brand?

Related Generative AI Terms

Chain of Thought

ChatGPT

Claude

Diffusion Model

Discriminator

Few-Shot Prompting

Foundation Model

GAN