1Cademy - A team is fine-tuning a large language model (the active model) to improve its performance on a specific task. They use the original, pre-trained version of the model as a fixed baseline. During training, a penalty is applied to the active model whenever its output probabilities for generating the next piece of text diverge significantly from the baseline models probabilities. What is the most likely reason for incorporating this penalty mechanism?

Learn Before

Reference Policy ( $\pi_{\theta_{\text{ref}}}$ )

Multiple Choice

A team is fine-tuning a large language model (the 'active model') to improve its performance on a specific task. They use the original, pre-trained version of the model as a fixed baseline. During training, a penalty is applied to the active model whenever its output probabilities for generating the next piece of text diverge significantly from the baseline model's probabilities. What is the most likely reason for incorporating this penalty mechanism?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related