A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?
Impact of Penalty Coefficient on LLM Fine-Tuning
Consequences of Modifying the PPO Objective Function