1Cademy - A development team is training a language model using two separate reward models: one that rewards helpfulness (RM-H) and another that rewards safety (RM-S). These two objectives are often in conflict. Instead of creating a single, combined reward score, the team decides to train the policy to optimize for both objectives simultaneously as distinct goals. Which of the following outcomes is the most direct and characteristic result of this specific training approach?

Learn Before

Multi-Objective Optimization for Policy Training with Multiple Reward Models

Multiple Choice

A development team is training a language model using two separate reward models: one that rewards helpfulness (RM-H) and another that rewards safety (RM-S). These two objectives are often in conflict. Instead of creating a single, combined reward score, the team decides to train the policy to optimize for both objectives simultaneously as distinct goals. Which of the following outcomes is the most direct and characteristic result of this specific training approach?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related