Formula

Empirical Pair-wise Ranking Loss for RLHF Reward Model

The reward model in RLHF is trained by minimizing an empirical pair-wise ranking loss, which is calculated as an average over the human preference dataset. This loss function encourages the model to assign a higher score to a preferred response (yay_a) over a less preferred one (yby_b) for the same input prompt. The formula, which is based on the Bradley-Terry model, is:

L(ϕ)=1Dr(x,ya,yb)Drlogσ(rϕ(x,ya)rϕ(x,yb))\mathcal{L}(\phi) = - \frac{1}{|\mathcal{D}_r|} \sum_{(x,y_a,y_b) \in \mathcal{D}_r} \log \sigma(r_\phi(x,y_a) - r_\phi(x,y_b))

Here, Dr\mathcal{D}_r is the preference dataset, rϕr_\phi is the reward model with parameters ϕ\phi, and σ\sigma is the sigmoid function.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences