Diagnosing Reward Model Score Inflation
A machine learning team is training a reward model. Their loss function is designed solely to maximize the score difference between a preferred response () and a non-preferred response (). After many training iterations, they observe that while the model correctly identifies the preferred response most of the time, the actual reward scores () are becoming extremely large (e.g., +500 for the preferred response and +498 for the non-preferred one). Why does a loss function focused only on the score difference permit this behavior, and what specific mathematical term, involving both and , could be added to the loss function to penalize this score inflation without discouraging a large difference between the scores?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Role of Regularization in Mitigating Reward Model Underdetermination
A reward model is being trained using a loss function that includes a regularization term to prevent its output scores from growing excessively large. The regularization component for a single pair of responses, , to an input, , is calculated as , where is the reward score. A higher value for this term results in a larger penalty. Given the following four pairs of reward scores, which pair would incur the largest penalty from this specific regularization term?
A reward model is being trained with a loss function that includes a regularization component. This component adds a penalty proportional to for a given input and a pair of responses . The goal of this penalty is to prevent reward scores from becoming excessively large. Consider two scenarios for the reward scores assigned to a pair of responses:
- Scenario 1: and
- Scenario 2: and
Based on the formula for the penalty, which of the following statements correctly analyzes the effect of the regularization in these two scenarios?
Diagnosing Reward Model Score Inflation