Learn Before
A team is training a text-generation model where a single quality score is assigned only after a complete multi-sentence response is generated. For all intermediate steps (i.e., before the final word), a default score of 0 is used. The team notices the model struggles to maintain a consistent narrative thread throughout its responses. Which statement best analyzes the relationship between this scoring method and the model's behavior?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is training a text-generation model where a single quality score is assigned only after a complete multi-sentence response is generated. For all intermediate steps (i.e., before the final word), a default score of 0 is used. The team notices the model struggles to maintain a consistent narrative thread throughout its responses. Which statement best analyzes the relationship between this scoring method and the model's behavior?
A language model is being trained using a reward model that provides a single quality score for a complete, generated response. If the model generates the four-token sequence ['The', 'cat', 'sat', '.'], which of the following reward lists best represents the standard sparse reward assignment for this process, where
r_tis the reward at stept?Evaluating a Sparse Reward Strategy