1Cademy - Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)

Learn Before

Reward Model as an Imperfect Environment Proxy

Problem

Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)

Reward hacking, also known as reward gaming or the overoptimization problem, is a phenomenon where an agent learns to exploit a reward model to achieve high scores without fulfilling the task's actual objectives. This behavior involves the agent effectively 'tricking' the model, leading to outcomes that are misaligned with the intended goals. Finding a comprehensive solution to this problem is a significant challenge, and no fully developed methods currently exist.

Updated 2025-10-07

Contributors are: