1Cademy - Comparison of RLHF and DPO Training Pipelines

Learn Before

Human Preference Alignment via Reward Models

Comparison

Comparison of RLHF and DPO Training Pipelines

Standard Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) differ significantly in their training pipelines. In standard RLHF (such as PPO), human preference data is first used to train a separate reward model, which is then employed to train both the target policy and the value function. In contrast, DPO simplifies this complex, multi-stage approach by establishing a more direct mapping: it uses human preference data directly to train the policy without the intermediate need for reward model training.