1Cademy - Derivation of DPO Preference Probability from Policy Ratios

Learn Before

Reward Function in Terms of Policy Models and Normalization Factor

Formula

Derivation of DPO Preference Probability from Policy Ratios

The probability that a preferred response $\mathbf{y}_a$ is ranked higher than a dispreferred response $\mathbf{y}_b$ given an input $\mathbf{x}$ can be derived using policy ratios. Starting from the Bradley-Terry model which depends on a latent reward function $r$ , we substitute the rewards expressed in terms of the target policy $\pi_{\theta}$ , the reference policy $\pi_{\theta_{\mathrm{ref}}}$ , and the normalization factor $Z(\mathbf{x})$ . During this derivation, the intractable $Z(\mathbf{x})$ neatly cancels out, transforming the difference in rewards into a difference of log-policy ratios:

$\begin{align*} \mathrm{Pr}_{\theta}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) &= \mathrm{Sigmoid}(r(\mathbf{x},\mathbf{y}_a)-r(\mathbf{x},\mathbf{y}_b)) \\ &= \mathrm{Sigmoid}\bigg(\beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} + \log Z(\mathbf{x}) \Big) - \beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} + \log Z(\mathbf{x}) \Big) \bigg) \\ &= \mathrm{Sigmoid}\bigg( \beta \log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} - \beta \log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} \bigg) \end{align*}$

This elegant formula allows calculating preference probabilities directly from policies, completely bypassing the need for a separate reward model.

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After