1Cademy - Evaluating a Flawed Solution to Reward Hacking

Learn Before

Explaining Overoptimization with Goodhart's Law

Essay

Evaluating a Flawed Solution to Reward Hacking

A company trains a language model to generate creative marketing slogans. To measure creativity, they create a reward model that gives high scores for slogans containing words from a predefined list of 100 'persuasive' adjectives. The model quickly learns to generate slogans packed with these adjectives, achieving near-perfect scores, but the resulting text is often nonsensical and grammatically incorrect. A junior developer suggests fixing this by expanding the list to 500 'persuasive' adjectives. Evaluate this proposed solution. Will it likely solve the underlying problem? Justify your answer by explaining the relationship between a metric and a target in this context.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related