Case Study

Diagnosing Reward Model Failure

An AI development team is training a language model to be a helpful assistant. They first create a dataset where human labelers compare pairs of model-generated responses to the same prompt and choose the better one. This comparison data is used to train a separate 'scoring' model. The goal of this scoring model is to predict which response a human would prefer. Finally, the main language model is trained to generate responses that achieve a high score from this scoring model. After the process is complete, the team observes that the final language model often produces outputs that are grammatically correct and sound confident, but are factually incorrect or nonsensical. However, these incorrect responses consistently receive high scores from the scoring model. What is the most likely flaw in the training of the 'scoring' model that would lead to this specific outcome?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science