Based on the scenario provided, analyze the underlying reason for the discrepancy between the high scores from the reward model and the poor quality of the stories as judged by human testers.

Google

In the context of RLHF, a reward model serves as a substitute, or proxy, for the true environment of human preferences. It provides a quantitative evaluation of an LLM's output. However, since the complexity of human values is immense and not fully knowable, any reward model is inherently an imperfect representation. Consequently, excessively optimizing an LLM's performance against this flawed proxy can paradoxically lead to a decline in its actual quality, a phenomenon referred to as the overoptimization problem.

Reward Model as an Imperfect Environment Proxy

Reward hacking, also known as reward gaming or the overoptimization problem, is a phenomenon where an agent learns to exploit a reward model to achieve high scores without fulfilling the task's actual objectives. This behavior involves the agent effectively 'tricking' the model, leading to outcomes that are misaligned with the intended goals. Finding a comprehensive solution to this problem is a significant challenge, and no fully developed methods currently exist.

Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)

A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?

Divergence in LLM Performance

Explain the paradoxical relationship where intensely optimizing a large language model against its reward model can lead to a degradation in its performance from a human perspective. In your explanation, detail why the reward model is considered a 'proxy' and what inherent limitations of this proxy cause this effect.

Learn Before

Related