A team is training a reward model using a loss function that only considers the relative ranking between two responses (i.e., that a preferred response gets a higher score than a dispreferred one). They observe that while the model learns the correct rankings, the absolute reward scores it assigns can grow uncontrollably large (e.g., scores of +1,000,000 and +999,999 are treated the same as +2 and +1). To fix this, they add a regularization term that penalizes the squared sum of the two rewards in each training pair. Which statement best analyzes how this specific regularization addresses the problem?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is training a reward model using a loss function that only considers the relative ranking between two responses (i.e., that a preferred response gets a higher score than a dispreferred one). They observe that while the model learns the correct rankings, the absolute reward scores it assigns can grow uncontrollably large (e.g., scores of +1,000,000 and +999,999 are treated the same as +2 and +1). To fix this, they add a regularization term that penalizes the squared sum of the two rewards in each training pair. Which statement best analyzes how this specific regularization addresses the problem?
Diagnosing Reward Model Instability
Stabilizing an Underdetermined Reward Model