Definition

Definition of Reward Hacking (Reward Gaming)

Reward hacking, also known as reward gaming, describes the phenomenon where an agent learns to exploit flaws in a reward model to achieve a high score, without actually fulfilling the intended goal of the task. This results in behavior that successfully 'tricks' the reward system but fails to align with the true objectives.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences