Comparison

Comparison of RLHF and DPO Training Pipelines

Standard Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) differ significantly in their training pipelines. In standard RLHF (such as PPO), human preference data is first used to train a separate reward model, which is then employed to train both the target policy and the value function. In contrast, DPO simplifies this complex, multi-stage approach by establishing a more direct mapping: it uses human preference data directly to train the policy without the intermediate need for reward model training.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After