1Cademy - An AI development team trains a language model to generate helpful summaries of news articles. They create a reward system that gives high scores to summaries that contain a high density of keywords from the original article. Initially, the models summaries improve. However, after extensive training, the team observes that the model produces summaries that are just lists of keywords, making them unreadable and unhelpful, even though they consistently achieve near-perfect reward scores. Which of the following principles best explains this outcome?

Learn Before

Goodhart's Law in Reward Modeling

Multiple Choice

An AI development team trains a language model to generate helpful summaries of news articles. They create a reward system that gives high scores to summaries that contain a high density of keywords from the original article. Initially, the model's summaries improve. However, after extensive training, the team observes that the model produces summaries that are just lists of keywords, making them unreadable and unhelpful, even though they consistently achieve near-perfect reward scores. Which of the following principles best explains this outcome?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related