Multiple Choice

A language model is being trained using an objective function that balances a reward-based component with a penalty for deviating from an initial reference policy. The penalty's influence is controlled by a coefficient, β. During training, developers observe that the model's outputs, while achieving high reward scores, are becoming increasingly repetitive and nonsensical. Which of the following adjustments to β is the most appropriate first step to mitigate this issue, and why?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science