1Cademy - A team is training a language model to generate helpful cooking recipes. They use a single reward model that scores recipes based on the number of ingredients from a predefined healthy list. They observe that the model starts generating nonsensical recipes that are just long lists of these healthy ingredients, achieving very high reward scores but being completely useless for cooking. Which of the following approaches is the most robust solution to prevent the model from exploiting the reward system in this way?

Learn Before

Combining Multiple Reward Models to Mitigate Overoptimization

Multiple Choice

A team is training a language model to generate helpful cooking recipes. They use a single reward model that scores recipes based on the number of ingredients from a predefined 'healthy' list. They observe that the model starts generating nonsensical recipes that are just long lists of these healthy ingredients, achieving very high reward scores but being completely useless for cooking. Which of the following approaches is the most robust solution to prevent the model from exploiting the reward system in this way?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related