Formula

Derivation of DPO Preference Probability from Policy Ratios

The probability that a preferred response ya\mathbf{y}_a is ranked higher than a dispreferred response yb\mathbf{y}_b given an input x\mathbf{x} can be derived using policy ratios. Starting from the Bradley-Terry model which depends on a latent reward function rr, we substitute the rewards expressed in terms of the target policy πθ\pi_{\theta}, the reference policy πθref\pi_{\theta_{\mathrm{ref}}}, and the normalization factor Z(x)Z(\mathbf{x}). During this derivation, the intractable Z(x)Z(\mathbf{x}) neatly cancels out, transforming the difference in rewards into a difference of log-policy ratios:

\begin{align*} \mathrm{Pr}_{\theta}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) &= \mathrm{Sigmoid}(r(\mathbf{x},\mathbf{y}_a)-r(\mathbf{x},\mathbf{y}_b)) \\ &= \mathrm{Sigmoid}\bigg(\beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} + \log Z(\mathbf{x}) \Big) - \beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} + \log Z(\mathbf{x}) \Big) \bigg) \\ &= \mathrm{Sigmoid}\bigg( \beta \log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} - \beta \log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} \bigg) \end{align*}

This elegant formula allows calculating preference probabilities directly from policies, completely bypassing the need for a separate reward model.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After