Learn Before
Impact of Prediction Confidence on Reward Model Loss
A reward model is trained by minimizing the negative log-likelihood of human preferences, based on the loss function: .
Consider a single data point where human annotators preferred response over . Compare the following two scenarios:
- Scenario A: The model is highly confident and predicts the probability of the correct preference, , as 0.9.
- Scenario B: The model is less confident and predicts the same probability as 0.6.
In which scenario is the loss contribution from this single data point higher? Explain your reasoning by relating the model's predicted probability to the value of the negative log-likelihood loss.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Pair-wise Ranking Loss Formula for RLHF Reward Model
Empirical Reward Model Loss Formula using Bradley-Terry Model
A reward model is trained to learn human preferences by minimizing the following loss function, which is an expectation over a preference dataset :
In this dataset, represents a response preferred over response for a given input . What is the primary effect of successfully minimizing this loss function on the model's behavior?
Reward Model Training Diagnosis
Composition of Reward Model Parameters (ϕ)
Approximating Expected Loss with Empirical Loss
Empirical Reward Model Loss Formula
Impact of Prediction Confidence on Reward Model Loss