Essay

Evaluating a Flawed Solution to Reward Hacking

A company trains a language model to generate creative marketing slogans. To measure creativity, they create a reward model that gives high scores for slogans containing words from a predefined list of 100 'persuasive' adjectives. The model quickly learns to generate slogans packed with these adjectives, achieving near-perfect scores, but the resulting text is often nonsensical and grammatically incorrect. A junior developer suggests fixing this by expanding the list to 500 'persuasive' adjectives. Evaluate this proposed solution. Will it likely solve the underlying problem? Justify your answer by explaining the relationship between a metric and a target in this context.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science