Learn Before
The Paradox of Optimization in Reward Modeling
Explain the paradoxical relationship where intensely optimizing a large language model against its reward model can lead to a degradation in its performance from a human perspective. In your explanation, detail why the reward model is considered a 'proxy' and what inherent limitations of this proxy cause this effect.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Overoptimization Problem in Reward Modeling (Reward Hacking or Reward Gaming)
A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?
Divergence in LLM Performance
The Paradox of Optimization in Reward Modeling