Stabilizing an Underdetermined Reward Model
A machine learning engineer observes that their reward model, trained to prefer one text completion over another, is learning the correct relative rankings. However, the absolute values of the rewards assigned to the completions are growing excessively large during training, leading to instability. Explain why this phenomenon is a classic sign of an underdetermined problem in this context, and describe how adding a penalty for large reward values to the training objective helps to mitigate this issue.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is training a reward model using a loss function that only considers the relative ranking between two responses (i.e., that a preferred response gets a higher score than a dispreferred one). They observe that while the model learns the correct rankings, the absolute reward scores it assigns can grow uncontrollably large (e.g., scores of +1,000,000 and +999,999 are treated the same as +2 and +1). To fix this, they add a regularization term that penalizes the squared sum of the two rewards in each training pair. Which statement best analyzes how this specific regularization addresses the problem?
Diagnosing Reward Model Instability
Stabilizing an Underdetermined Reward Model