Derivation of DPO Preference Probability from Policy Ratios
The probability that a preferred response is ranked higher than a dispreferred response given an input can be derived using policy ratios. Starting from the Bradley-Terry model which depends on a latent reward function , we substitute the rewards expressed in terms of the target policy , the reference policy , and the normalization factor . During this derivation, the intractable neatly cancels out, transforming the difference in rewards into a difference of log-policy ratios:
\begin{align*} \mathrm{Pr}_{\theta}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) &= \mathrm{Sigmoid}(r(\mathbf{x},\mathbf{y}_a)-r(\mathbf{x},\mathbf{y}_b)) \\ &= \mathrm{Sigmoid}\bigg(\beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} + \log Z(\mathbf{x}) \Big) - \beta \Big(\log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} + \log Z(\mathbf{x}) \Big) \bigg) \\ &= \mathrm{Sigmoid}\bigg( \beta \log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_a|\mathbf{x})} - \beta \log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}_b|\mathbf{x})} \bigg) \end{align*}
This elegant formula allows calculating preference probabilities directly from policies, completely bypassing the need for a separate reward model.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
In a policy-based language model alignment process, the reward
r(x, y)for a responseyto a promptxis defined by the equation: whereπ_θis the target policy,π_θ_refis the reference policy,βis a positive scaling factor, andZ(x)is a normalization factor. If, for a specific responsey_1, the target policy assigns a lower probability than the reference policy (i.e.,π_θ(y_1|x) < π_θ_ref(y_1|x)), what is the direct consequence for the log-ratio component of the reward calculation?In a framework for aligning language models, a reward function is defined as: where is the target policy, is a reference policy, is a scaling factor, and is a normalization factor dependent on the prompt . Given two distinct responses, and , to the same prompt , which expression correctly represents the difference in their rewards, ?
Derivation of DPO Preference Probability from Policy Ratios
Analysis of Reward Function under Policy Convergence
Learn After
Elimination of the Reward Model in DPO
A key step in an alignment algorithm involves re-expressing the preference probability of a chosen response () over a rejected response () for a given input (). The derivation is as follows:
Based on this mathematical simplification, what is the most significant practical consequence for the model training process?
Analysis of Normalization Factor Cancellation
The derivation of the preference probability in terms of policy ratios involves several key steps. Arrange the following mathematical expressions in the correct logical order to show how the initial preference model is transformed into the final expression used for optimization.
Your team must choose an alignment approach for an...
Your team is implementing preference-based alignme...
Your team is reviewing two proposed alignment impl...
In a preference-based LLM alignment project, your ...
Selecting and Justifying DPO vs. RLHF for Preference Alignment Under Operational Constraints
Explaining DPO’s Objective as Offline RL Without a Reward Model: A Pipeline and Math-Based Justification
Diagnosing a “Missing Reward Model” DPO Implementation and Its Offline Implications
Post-Deployment Alignment Update: Choosing Between DPO and RLHF Under Logging and Compute Constraints
Interpreting DPO Preference Probabilities and Pipeline Implications from Logged Policy Ratios
Choosing an Alignment Pipeline and Debugging a DPO Objective Under Compute and Data Constraints
Direct Preference Optimization (DPO) Loss Function