1Cademy - The Paradox of Optimization in Reward Modeling

Learn Before

Reward Model as an Imperfect Environment Proxy

Essay

The Paradox of Optimization in Reward Modeling

Explain the paradoxical relationship where intensely optimizing a large language model against its reward model can lead to a degradation in its performance from a human perspective. In your explanation, detail why the reward model is considered a 'proxy' and what inherent limitations of this proxy cause this effect.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related