Case Study

Diagnosing Reward Model Score Inflation

A machine learning team is training a reward model. Their loss function is designed solely to maximize the score difference between a preferred response (yay_a) and a non-preferred response (yby_b). After many training iterations, they observe that while the model correctly identifies the preferred response most of the time, the actual reward scores (r(x,y)r(x, y)) are becoming extremely large (e.g., +500 for the preferred response and +498 for the non-preferred one). Why does a loss function focused only on the score difference permit this behavior, and what specific mathematical term, involving both r(x,ya)r(x, y_a) and r(x,yb)r(x, y_b), could be added to the loss function to penalize this score inflation without discouraging a large difference between the scores?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related