Combined Reward Formula
The combined reward, , is calculated by taking a weighted average of the outputs from different reward models. Each individual reward model's output, , is multiplied by a weight . These products are summed up over all models, and the result is normalized by dividing by . The formula is expressed as:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined Reward Formula
An AI development team is using an ensemble of three separate models to evaluate a single generated response. The first model gives the response a score of 8.0, the second model gives it a score of 9.0, and the third model gives it a score of 7.0. To create a more robust and stable final evaluation, the team decides to use a simple averaging method. What is the final combined score for the response?
An AI development team is using three specialized reward models to evaluate generated text: one for general helpfulness, one for factual accuracy, and one for safety. They combine the outputs of these models by taking a simple, unweighted average to produce a single final score. What is the most significant potential drawback of this specific approach?
Evaluating a Chatbot's Response Score
Learn After
Using Combined Reward for Policy Supervision
An AI alignment team is evaluating a language model's response using three distinct reward models: Helpfulness, Harmlessness, and Conciseness. For a specific response, the models provide the following scores and are assigned the following weights:
- Helpfulness: Score = 8.0, Weight = 2.0
- Harmlessness: Score = 9.0, Weight = 3.0
- Conciseness: Score = 6.0, Weight = 1.0
Using the weighted average formula for combining rewards, what is the final aggregated reward score for this response? (Assume K is the total number of models).
Adjusting Chatbot Behavior via Reward Model Weighting
Component Analysis of the Combined Reward Formula