Comparison

Justification for Using RLHF over Supervised Learning

Reinforcement Learning from Human Feedback (RLHF) is often preferred over standard supervised learning for model alignment due to fundamental difficulties in data annotation. For supervised methods, it is challenging for humans to articulate complex values and goals, and even more difficult to demonstrate them by authoring perfectly aligned outputs. RLHF addresses this by shifting the human task from difficult demonstration to the simpler act of expressing preferences over a list of model-generated options. This preference data is then used to train a reward model that captures human values. Furthermore, RLHF offers an exploration advantage, as it can use sampling to generate and evaluate outputs beyond the original annotated dataset, potentially discovering superior policies.

0

1

Updated 2026-04-20

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related