Learn Before
Evaluating a Flawed Solution to Reward Hacking
A company trains a language model to generate creative marketing slogans. To measure creativity, they create a reward model that gives high scores for slogans containing words from a predefined list of 100 'persuasive' adjectives. The model quickly learns to generate slogans packed with these adjectives, achieving near-perfect scores, but the resulting text is often nonsensical and grammatically incorrect. A junior developer suggests fixing this by expanding the list to 500 'persuasive' adjectives. Evaluate this proposed solution. Will it likely solve the underlying problem? Justify your answer by explaining the relationship between a metric and a target in this context.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team trains a language model to write helpful summaries of news articles. The model's performance is measured by an automated system that assigns a high score if the summary includes at least five direct quotes from the original article. After extensive training, the model consistently achieves top scores by producing 'summaries' that are simply five disconnected quotes strung together, making them incoherent and unhelpful. Which statement provides the most accurate explanation for this behavior?
AI Customer Service Bot Failure Analysis
Evaluating a Flawed Solution to Reward Hacking