Overoptimization Problem in Reward Modeling
The overoptimization problem occurs when excessively aligning a large language model with an imperfect reward model leads to a decline in the model's true performance. This happens because the LLM learns to exploit flaws in the proxy measure rather than improving its ability to perform the actual desired task.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustments to the training process would be most effective at specifically addressing this issue?
Analysis of Aggregated Reward Signals in Model Training
Overoptimization Problem in Reward Modeling
Goodhart's Law in Reward Modeling