Learn Before
Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)
Reward hacking, also known as reward gaming or the overoptimization problem, is a phenomenon where an agent learns to exploit a reward model to achieve high scores without fulfilling the task's actual objectives. This behavior involves the agent effectively 'tricking' the model, leading to outcomes that are misaligned with the intended goals. Finding a comprehensive solution to this problem is a significant challenge, and no fully developed methods currently exist.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)
A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?
Divergence in LLM Performance
The Paradox of Optimization in Reward Modeling