1Cademy - Impact of Prediction Confidence on Reward Model Loss

Learn Before

Reward Model Loss as Negative Log-Likelihood

Short Answer

Impact of Prediction Confidence on Reward Model Loss

A reward model is trained by minimizing the negative log-likelihood of human preferences, based on the loss function: $\mathcal{L}_r(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\sim\mathcal{D}_r} [\log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})]$ .

Consider a single data point $(\mathbf{x}, \mathbf{y}_a, \mathbf{y}_b)$ where human annotators preferred response $\mathbf{y}_a$ over $\mathbf{y}_b$ . Compare the following two scenarios:

Scenario A: The model is highly confident and predicts the probability of the correct preference, $\text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})$ , as 0.9.
Scenario B: The model is less confident and predicts the same probability as 0.6.

In which scenario is the loss contribution from this single data point higher? Explain your reasoning by relating the model's predicted probability to the value of the negative log-likelihood loss.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related