Analysis of Aggregated Reward Signals in Model Training
A team is training a language model to generate multi-paragraph stories. Their training process involves: 1) breaking each generated story into individual paragraphs (segments), 2) having a separate system score each paragraph for quality, 3) summing these individual paragraph scores to get a single 'total quality score' for the entire story, and 4) using only this single total score as the feedback signal to update the model. Analyze the primary limitation of this training approach with respect to how the model attributes credit or blame for its performance.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A policy model is being trained to generate summaries. Each generated summary is broken down into three sequential segments: beginning, middle, and end. A reward score is calculated for each segment, and the total reward for the summary is the simple sum of these three scores. This total reward is then used to update the model. During testing, it is observed that the model consistently generates summaries with a strong beginning but a weak, often incoherent, end. Which of the following adjustments to the training process would be most effective at specifically addressing this issue?
Analysis of Aggregated Reward Signals in Model Training
Overoptimization Problem in Reward Modeling
Goodhart's Law in Reward Modeling