Formula

Empirical Reward Model Loss Formula using Bradley-Terry Model

The reward model is trained by minimizing an empirical loss function derived from the Bradley-Terry model for pairwise comparisons. The objective is to adjust the model's parameters, ϕ\phi, to minimize the negative log-likelihood of the observed human preferences in the dataset Dr\mathcal{D}_r. This is achieved by applying the sigmoid function to the difference in reward scores for the preferred response, ya\mathbf{y}_a, and the rejected response, yb\mathbf{y}_b, and then minimizing the negative logarithm of this probability, averaged over the entire dataset. The formula is:

minϕ1Dr(x,ya,yb)Drlogσ(rϕ(x,ya)rϕ(x,yb))\min_{\phi} - \frac{1}{|\mathcal{D}_r|} \sum_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \in \mathcal{D}_r} \log \sigma(r_\phi(\mathbf{x},\mathbf{y}_a) - r_\phi(\mathbf{x},\mathbf{y}_b))

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related