1Cademy - A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penaltys influence is controlled by a coefficient, β. During training, developers observe that the models outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?

Learn Before

Overall PPO Objective Function for Language Models

Multiple Choice

A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related