Formula

Rearrangement of the Assumed DPO Objective

To isolate the variable θ\theta, the assumed Direct Preference Optimization (DPO) objective function is mathematically rearranged. By manipulating the formula, the target policy term πθ\pi_{\theta} can be separated from the fixed reference terms. The objective is transformed into the expected difference between the log-probability of the target policy and a fixed function of y\mathbf{y}:

θ~=arg minθExDEyπθ(x)[logπθ(yx)log(πθref(yx)exp(1βr(x,y)))]\tilde{\theta} = \argmin_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ \log \pi_{\theta}(\mathbf{y}|\mathbf{x}) - \log \big( \pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big) \big) \big]

This formulation expresses the objective as a difference involving log-probability functions, paving the way for it to be interpreted as a divergence between distributions.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences