Learn Before
End-of-Sequence Reward Assignment in RLHF
When applying an RLHF reward model, the reward signal is typically sparse. A reward score is generated only at the final position of the output sequence y = y1...yn. For all intermediate positions t < n, the model assigns a default value, such as 0. The actual, meaningful reward is calculated only upon completion of the entire sequence.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reward Score Formula for LLM-based Reward Models
End-of-Sequence Reward Assignment in RLHF
In a system designed to evaluate the quality of generated text, a complex neural network first processes a prompt and its corresponding response, ultimately producing a high-dimensional vector that captures the nuanced meaning and relationship between them. What is the essential final step required to convert this complex vector into a practical, usable evaluation, and what is the nature of its output?
Troubleshooting a Reward Model's Output
From Representation to Reward
Learn After
A team is training a text-generation model where a single quality score is assigned only after a complete multi-sentence response is generated. For all intermediate steps (i.e., before the final word), a default score of 0 is used. The team notices the model struggles to maintain a consistent narrative thread throughout its responses. Which statement best analyzes the relationship between this scoring method and the model's behavior?
A language model is being trained using a reward model that provides a single quality score for a complete, generated response. If the model generates the four-token sequence ['The', 'cat', 'sat', '.'], which of the following reward lists best represents the standard sparse reward assignment for this process, where
r_tis the reward at stept?Evaluating a Sparse Reward Strategy