Formula

Derivation of the Bradley-Terry Preference Formula

The Bradley-Terry model can be used to express the probability of one item, ya\mathbf{y}_a, being preferred over another, yb\mathbf{y}_b, given a context x\mathbf{x}. The model starts by defining this probability as the ratio of the exponentiated reward score of the preferred item to the sum of the exponentiated scores of both items. This formulation can be algebraically simplified to the sigmoid function of the difference between the two reward scores. The derivation proceeds as follows: Pr(yaybx)=er(x,ya)er(x,ya)+er(x,yb)=er(x,ya)r(x,yb)er(x,ya)r(x,yb)+1=Sigmoid(r(x,ya)r(x,yb))\text{Pr}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) = \frac{e^{r(\mathbf{x}, \mathbf{y}_a)}}{e^{r(\mathbf{x}, \mathbf{y}_a)} + e^{r(\mathbf{x}, \mathbf{y}_b)}} = \frac{e^{r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b)}}{e^{r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b)} + 1} = \text{Sigmoid}(r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b)) This derivation shows how a model based on exponentiated scores is equivalent to modeling the preference probability using the sigmoid of the score difference.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related