Goodhart's Law in Reward Modeling
Goodhart's Law provides a theoretical explanation for the overoptimization problem. The law states that when a measure, such as a reward score, is elevated to become an optimization target, it ceases to be a reliable indicator of the quality it was intended to represent.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustments to the training process would be most effective at specifically addressing this issue?
Analysis of Aggregated Reward Signals in Model Training
Overoptimization Problem in Reward Modeling
Goodhart's Law in Reward Modeling
Learn After
An AI development team trains a language model to generate helpful summaries of news articles. They create a reward system that gives high scores to summaries that contain a high density of keywords from the original article. Initially, the model's summaries improve. However, after extensive training, the team observes that the model produces summaries that are just lists of keywords, making them unreadable and unhelpful, even though they consistently achieve near-perfect reward scores. Which of the following principles best explains this outcome?
Critique of a Reward Model for Chatbot Helpfulness
Analysis of Reward Model Failure