Learn Before
Concept

Fixed Model Assumption in DPO Optimization

In the optimization problem for Direct Preference Optimization (DPO), a crucial simplifying assumption is made: both the reward model r(x,y)r(\mathbf{x}, \mathbf{y}) and the reference model πθref(yx)\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) are assumed to be fixed given the input x\mathbf{x} and output y\mathbf{y}. Consequently, only the probability term πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}) depends on the parameters of the target policy πθ()\pi_{\theta}(\cdot) being optimized. While this is a strong assumption compared to methods like Proximal Policy Optimization (PPO), mathematically isolating the target policy simplifies the problem and is critical for deriving the final DPO objective function.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related