Comparison

Comparison of DPO's Fixed Model Assumption with PPO

The core assumption in Direct Policy Optimization (DPO)—that the reward and reference models are fixed—is considered a strong assumption when contrasted with methods like Proximal Policy Optimization (PPO). This fundamental difference in the treatment of model components during optimization is what enables DPO to simplify the alignment problem, distinguishing its approach from the more complex dynamics of PPO.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences