1Cademy - Using Combined Reward for Policy Supervision

Learn Before

Combined Reward Formula

Activity (Process)

Using Combined Reward for Policy Supervision

The aggregated reward score, which is calculated by combining the outputs from multiple reward models, is used as the primary feedback signal to guide and supervise the training of a policy model.

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Diagnosing Undesirable Model Behavior
An AI development team is training a policy model for a chatbot using a combined reward signal. This signal is a weighted average of scores from two reward models: one for 'Helpfulness' (scoring accuracy and completeness) and one for 'Harmlessness' (scoring safety and ethical considerations). The team observes that the resulting chatbot is overly cautious, frequently refusing to answer benign questions by stating it cannot help. Which of the following is the most effective and direct adjustment to the training process to correct this specific behavior?

Learn Before

Related

Learn After