1Cademy - A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?

Learn Before

Reward Model as an Imperfect Environment Proxy

Multiple Choice

A team is training a large language model using a scoring function derived from human preference data. They observe that after a certain point, continuing to train the model to maximize its score leads to a decrease in the actual quality of its responses as judged by human evaluators. What is the most fundamental reason for this phenomenon?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related