1Cademy - Rearrangement of the Assumed DPO Objective

Learn Before

Conceptual Objective Function Assumed in DPO

Formula

Rearrangement of the Assumed DPO Objective

To isolate the variable $\theta$ , the assumed Direct Preference Optimization (DPO) objective function is mathematically rearranged. By manipulating the formula, the target policy term $\pi_{\theta}$ can be separated from the fixed reference terms. The objective is transformed into the expected difference between the log-probability of the target policy and a fixed function of $\mathbf{y}$ :

$\tilde{\theta} = \argmin_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \big( \pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big) \big) \big]$

This formulation expresses the objective as a difference involving log-probability functions, paving the way for it to be interpreted as a divergence between distributions.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related