1Cademy - Modeling Pairwise Preference Probability with a Reward Function

Learn Before

Conditional Probability of Pairwise Preference

Formula

Modeling Pairwise Preference Probability with a Reward Function

The probability that a response $\mathbf{y}_a$ is preferred over another response $\mathbf{y}_b$ given an input $\mathbf{x}$ is modeled using a learned reward function $r(\mathbf{x}, \mathbf{y})$ . This is achieved by applying the sigmoid function to the difference between the reward scores of the two responses, as specified by the Bradley-Terry model. The formula is: $\text{Pr}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x}) = \text{Sigmoid}(r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b))$ . This is a foundational component for training reward models in RLHF.

0

1

Updated 2025-10-09

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After