1Cademy - Fixed Model Assumption in DPO Optimization

Learn Before

Direct Preference Optimization (DPO)

Concept

Fixed Model Assumption in DPO Optimization

In the optimization problem for Direct Preference Optimization (DPO), a crucial simplifying assumption is made: both the reward model $r(\mathbf{x}, \mathbf{y})$ and the reference model $\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x})$ are assumed to be fixed given the input $\mathbf{x}$ and output $\mathbf{y}$ . Consequently, only the probability term $\pi_{\theta}(\mathbf{y}|\mathbf{x})$ depends on the parameters of the target policy $\pi_{\theta}(\cdot)$ being optimized. While this is a strong assumption compared to methods like Proximal Policy Optimization (PPO), mathematically isolating the target policy simplifies the problem and is critical for deriving the final DPO objective function.

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After