A development team is training a language model using two separate reward models: one that rewards helpfulness (RM-H) and another that rewards safety (RM-S). These two objectives are often in conflict. Instead of creating a single, combined reward score, the team decides to train the policy to optimize for both objectives simultaneously as distinct goals. Which of the following outcomes is the most direct and characteristic result of this specific training approach?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is training a language model using two separate reward models: one that rewards helpfulness (RM-H) and another that rewards safety (RM-S). These two objectives are often in conflict. Instead of creating a single, combined reward score, the team decides to train the policy to optimize for both objectives simultaneously as distinct goals. Which of the following outcomes is the most direct and characteristic result of this specific training approach?
Optimizing a Chatbot for Competing Goals
Comparing Reward Optimization Strategies