1Cademy - Analysis of Normalization Factor Cancellation

Learn Before

Derivation of DPO Preference Probability from Policy Ratios

Short Answer

Analysis of Normalization Factor Cancellation

The process of re-expressing preference probabilities for a chosen response ( $\mathbf{y}_a$ ) over a rejected response ( $\mathbf{y}_b$ ) begins by substituting the implicit reward function into the preference model. The reward function for a given response $\mathbf{y}$ is defined as $r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)$ . The preference probability is modeled as $\text{Pr}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x}) = \text{Sigmoid}(r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b))$ . Explain precisely why the normalization factor term, $\log Z(\mathbf{x})$ , does not appear in the final simplified expression for the preference probability.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related