1Cademy - Multi-Objective Optimization for Policy Training with Multiple Reward Models

Learn Before

Combining Multiple Reward Models to Mitigate Overoptimization

Concept

Multi-Objective Optimization for Policy Training with Multiple Reward Models

Instead of combining multiple reward models into a single reward signal, an alternative approach is to treat the task as a multi-objective optimization problem. This framework involves training the policy to simultaneously optimize for the objectives defined by each individual reward model.

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

A development team is training a language model using two separate reward models: one that rewards helpfulness (RM-H) and another that rewards safety (RM-S). These two objectives are often in conflict. Instead of creating a single, combined reward score, the team decides to train the policy to optimize for both objectives simultaneously as distinct goals. Which of the following outcomes is the most direct and characteristic result of this specific training approach?
Optimizing a Chatbot for Competing Goals
Comparing Reward Optimization Strategies

Learn Before

Related

Learn After