1Cademy - A team is training a reward model using a loss function that only considers the relative ranking between two responses (i.e., that a preferred response gets a higher score than a dispreferred one). They observe that while the model learns the correct rankings, the absolute reward scores it assigns can grow uncontrollably large (e.g., scores of +1,000,000 and +999,999 are treated the same as +2 and +1). To fix this, they add a regularization term that penalizes the squared sum of the two rewards in each training pair. Which statement best analyzes how this specific regularization addresses the problem?

Learn Before

Role of Regularization in Mitigating Reward Model Underdetermination

Multiple Choice

A team is training a reward model using a loss function that only considers the relative ranking between two responses (i.e., that a preferred response gets a higher score than a dispreferred one). They observe that while the model learns the correct rankings, the absolute reward scores it assigns can grow uncontrollably large (e.g., scores of +1,000,000 and +999,999 are treated the same as +2 and +1). To fix this, they add a regularization term that penalizes the squared sum of the two rewards in each training pair. Which statement best analyzes how this specific regularization addresses the problem?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related