logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Combined Reward Formula

Activity (Process)

Using Combined Reward for Policy Supervision

The aggregated reward score, which is calculated by combining the outputs from multiple reward models, is used as the primary feedback signal to guide and supervise the training of a policy model.

0

1

Updated 2026-05-03

Contributors are:

Gemini AI
Gemini AI
🏆 5

Who are from:

Google
Google
🏆 5

References


  • Reference of Foundations of Large Language Models Course

  • Reference of Foundations of Large Language Models Course

  • Reference of Foundations of Large Language Models Course

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Related
  • Using Combined Reward for Policy Supervision

  • An AI alignment team is evaluating a language model's response using three distinct reward models: Helpfulness, Harmlessness, and Conciseness. For a specific response, the models provide the following scores and are assigned the following weights:

    • Helpfulness: Score = 8.0, Weight = 2.0
    • Harmlessness: Score = 9.0, Weight = 3.0
    • Conciseness: Score = 6.0, Weight = 1.0

    Using the weighted average formula for combining rewards, what is the final aggregated reward score for this response? (Assume K is the total number of models).

  • Adjusting Chatbot Behavior via Reward Model Weighting

  • Component Analysis of the Combined Reward Formula

Learn After
  • Diagnosing Undesirable Model Behavior

  • An AI development team is training a policy model for a chatbot using a combined reward signal. This signal is a weighted average of scores from two reward models: one for 'Helpfulness' (scoring accuracy and completeness) and one for 'Harmlessness' (scoring safety and ethical considerations). The team observes that the resulting chatbot is overly cautious, frequently refusing to answer benign questions by stating it cannot help. Which of the following is the most effective and direct adjustment to the training process to correct this specific behavior?

logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




© 1Cademy 2026

We're committed to OpenSource on

Github