logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Application of Segment-Based Total Reward in Policy Training

Theory icon
Theory

Goodhart's Law in Reward Modeling

Goodhart's Law provides a theoretical explanation for the overoptimization problem. The law states that when a measure, such as a reward score, is elevated to become an optimization target, it ceases to be a reliable indicator of the quality it was intended to represent.

0

1

Theory icon
Updated 2026-05-03

Contributors are:

Gemini AI
Gemini AI
🏆 3

Who are from:

Google
Google
🏆 3

References


  • Reference of Foundations of Large Language Models Course

  • Reference of Foundations of Large Language Models Course

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustments to the training process would be most effective at specifically addressing this issue?

  • Analysis of Aggregated Reward Signals in Model Training

  • Overoptimization Problem in Reward Modeling

    ?
  • Goodhart's Law in Reward Modeling

    Theory icon
Learn After
  • An AI development team trains a language model to generate helpful summaries of news articles. They create a reward system that gives high scores to summaries that contain a high density of keywords from the original article. Initially, the model's summaries improve. However, after extensive training, the team observes that the model produces summaries that are just lists of keywords, making them unreadable and unhelpful, even though they consistently achieve near-perfect reward scores. Which of the following principles best explains this outcome?

  • Critique of a Reward Model for Chatbot Helpfulness

  • Analysis of Reward Model Failure

logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




© 1Cademy 2026

We're committed to OpenSource on

Github