Learn Before
Application of Segment-Based Total Reward in Policy Training
The total reward score, , which is aggregated from the scores of individual segments of a generated output, is utilized as the primary reward signal in the standard training process for the policy model.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Related
Application of Segment-Based Total Reward in Policy Training
A language model generates a three-segment response to a user's prompt. A separate reward model evaluates each segment, considering the full context of the prompt and the complete response, and assigns the following scores: Segment 1: 0.8, Segment 2: -0.3, Segment 3: 0.5. According to the principle of aggregating segment-based scores, what is the total reward for the entire generated response?
Analyzing Reward Model Behavior
Calculating a Missing Segment Score
Learn After
A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustments to the training process would be most effective at specifically addressing this issue?
Analysis of Aggregated Reward Signals in Model Training
Overoptimization Problem in Reward Modeling
Goodhart's Law in Reward Modeling