Formula

Reward Model Loss as Negative Log-Likelihood

To train the reward model in RLHF, the objective is to maximize the preference probability defined by the Bradley-Terry model. This is mathematically achieved by minimizing a loss function based on the negative log-likelihood over the human preference dataset Dr\mathcal{D}_r. The loss function is given by: Lr(ϕ)=E(x,ya,yb)Dr[logPrϕ(yaybx)]\mathcal{L}_r(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big], where ϕ\phi represents the trainable parameters of the reward model, and each sample denotes a preference for ya\mathbf{y}_a over yb\mathbf{y}_b given input x\mathbf{x}.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Related