Impact of Penalty Coefficient on LLM Fine-Tuning
Analyze the two scenarios described in the case study below. For each scenario, predict the most likely behavior of the fine-tuned language model and explain your reasoning by referring to the components of the combined objective function used in training.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?
Impact of Penalty Coefficient on LLM Fine-Tuning
Consequences of Modifying the PPO Objective Function