Short Answer

Impact of Prediction Confidence on Reward Model Loss

A reward model is trained by minimizing the negative log-likelihood of human preferences, based on the loss function: Lr(ϕ)=E(x,ya,yb)Dr[logPrϕ(yaybx)]\mathcal{L}_r(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\sim\mathcal{D}_r} [\log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})].

Consider a single data point (x,ya,yb)(\mathbf{x}, \mathbf{y}_a, \mathbf{y}_b) where human annotators preferred response ya\mathbf{y}_a over yb\mathbf{y}_b. Compare the following two scenarios:

  • Scenario A: The model is highly confident and predicts the probability of the correct preference, Prϕ(yaybx)\text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x}), as 0.9.
  • Scenario B: The model is less confident and predicts the same probability as 0.6.

In which scenario is the loss contribution from this single data point higher? Explain your reasoning by relating the model's predicted probability to the value of the negative log-likelihood loss.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science