Consequences of Modifying the PPO Objective Function
A researcher is training a language model using an objective function that combines a clipped surrogate objective (which encourages high rewards) with a policy divergence penalty, controlled by a coefficient β. If the researcher decides to set β to zero for the entire training process, what are the likely consequences for the model's generated text? Describe both a potential short-term benefit and a significant long-term drawback.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?
Impact of Penalty Coefficient on LLM Fine-Tuning
Consequences of Modifying the PPO Objective Function