1Cademy - Reward Model Loss as Negative Log-Likelihood

Learn Before

Formula

Reward Model Loss as Negative Log-Likelihood

To train the reward model in RLHF, the objective is to maximize the preference probability defined by the Bradley-Terry model. This is mathematically achieved by minimizing a loss function based on the negative log-likelihood over the human preference dataset $\mathcal{D}_r$ . The loss function is given by: $\mathcal{L}_r(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big]$ , where $\phi$ represents the trainable parameters of the reward model, and each sample denotes a preference for $\mathbf{y}_a$ over $\mathbf{y}_b$ given input $\mathbf{x}$ .

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After