1Cademy - Diagnosing Reward Model Score Inflation

Learn Before

Regularized Pairwise Loss Function for Reward Model Training

Case Study

Diagnosing Reward Model Score Inflation

A machine learning team is training a reward model. Their loss function is designed solely to maximize the score difference between a preferred response ( $y_a$ ) and a non-preferred response ( $y_b$ ). After many training iterations, they observe that while the model correctly identifies the preferred response most of the time, the actual reward scores ( $r(x, y)$ ) are becoming extremely large (e.g., +500 for the preferred response and +498 for the non-preferred one). Why does a loss function focused only on the score difference permit this behavior, and what specific mathematical term, involving both $r(x, y_a)$ and $r(x, y_b)$ , could be added to the loss function to penalize this score inflation without discouraging a large difference between the scores?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related