1Cademy - A key step in an alignment algorithm involves re-expressing the preference probability of a chosen response ($\mathbf{y}_a$) over a rejected response ($\mathbf{y}_b$) for a given input ($\mathbf{x}$). The derivation is as follows: $$ \begin{align*} \text{Pr}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x}) &= \text{Sigmoid}\left(\beta\left(\log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_a|\mathbf{x})} + \log Z(\mathbf{x})\right) - \beta\left(\log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_b|\mathbf{x})} + \log Z(\mathbf{x})\right)\right) \\ &= \text{Sigmoid}\left(\beta \log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_a|\mathbf{x})} - \beta \log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_b|\mathbf{x})}\right) \end{align*} $$ Based on this mathematical simplification, what is the most significant practical consequence for the model training process?

Learn Before

Derivation of DPO Preference Probability from Policy Ratios

Multiple Choice

A key step in an alignment algorithm involves re-expressing the preference probability of a chosen response ( $\mathbf{y}_a$ ) over a rejected response ( $\mathbf{y}_b$ ) for a given input ( $\mathbf{x}$ ). The derivation is as follows:

\begin{align*} \text{Pr}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x}) &= \text{Sigmoid}\left(\beta\left(\log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_a|\mathbf{x})} + \log Z(\mathbf{x})\right) - \beta\left(\log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_b|\mathbf{x})} + \log Z(\mathbf{x})\right)\right) \\ &= \text{Sigmoid}\left(\beta \log \frac{\pi_{\theta}(\mathbf{y}_a|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_a|\mathbf{x})} - \beta \log \frac{\pi_{\theta}(\mathbf{y}_b|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}_b|\mathbf{x})}\right) \end{align*}

Based on this mathematical simplification, what is the most significant practical consequence for the model training process?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related