1Cademy - An AI development team is refining a pre-trained language model using a dataset of human preferences, where each example consists of a prompt, a preferred response, and a rejected response. As training progresses, they notice that while the model is learning to generate responses that align with the preferences, its general language quality is deteriorating; it produces more repetitive and nonsensical text. What is the most probable cause of this issue related to the optimization objectives design?

Learn Before

Reference Policy in DPO's Penalty Term

Multiple Choice

An AI development team is refining a pre-trained language model using a dataset of human preferences, where each example consists of a prompt, a preferred response, and a rejected response. As training progresses, they notice that while the model is learning to generate responses that align with the preferences, its general language quality is deteriorating; it produces more repetitive and nonsensical text. What is the most probable cause of this issue related to the optimization objective's design?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related