1Cademy - Application of Segment-Based Total Reward in Policy Training

Learn Before

Total Reward as Sum of Segment-Based Scores

Activity (Process)

Application of Segment-Based Total Reward in Policy Training

The total reward score, $r(\mathbf{x}, \mathbf{y})$ , which is aggregated from the scores of individual segments of a generated output, is utilized as the primary reward signal in the standard training process for the policy model.

Updated 2025-10-07

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn After

A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustmen
Analysis of Aggregated Reward Signals in Model Training
Overoptimization Problem in Reward Modeling
Goodhart's Law in Reward Modeling

Learn Before

Related

Learn After