1Cademy - During the fine-tuning of a language model using a reward signal, a team observes that the models outputs are becoming nonsensical, even though they receive high reward scores. The model is essentially gaming the reward system. Which component in this training setup is specifically intended to mitigate this issue by penalizing the model for deviating too far from its initial, coherent language patterns?

Learn Before

Role and Definition of the Reference Model in RLHF

Multiple Choice

During the fine-tuning of a language model using a reward signal, a team observes that the model's outputs are becoming nonsensical, even though they receive high reward scores. The model is essentially 'gaming' the reward system. Which component in this training setup is specifically intended to mitigate this issue by penalizing the model for deviating too far from its initial, coherent language patterns?

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related