1Cademy - Conceptual Objective Function Assumed in DPO

Learn Before

Formula

Conceptual Objective Function Assumed in DPO

Before explicitly deriving the Direct Preference Optimization (DPO) objective, the method conceptually assumes a foundational policy training objective where the quality of an output $\mathbf{y}$ given an input $\mathbf{x}$ is evaluated by a theoretical reward model $r(\mathbf{x}, \mathbf{y})$ . The goal is to find optimal parameters $\tilde{\theta}$ by minimizing a loss term (the negative reward, $-r(\mathbf{x}, \mathbf{y})$ ) and a penalty term that regularizes the target policy $\pi_\theta$ against a reference policy $\pi_{\theta_{\text{ref}}}$ . The assumed training objective is given by:

$\tilde{\theta} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \underbrace{-r(\mathbf{x}, \mathbf{y})}_{\text{loss}} + \beta \underbrace{(\log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}))}_{\text{penalty}} \big]$

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After