1Cademy - End-of-Sequence Reward Assignment in RLHF

Learn Before

Final Reward Score Calculation in RLHF

Concept

End-of-Sequence Reward Assignment in RLHF

When applying an RLHF reward model, the reward signal is typically sparse. A reward score is generated only at the final position of the output sequence y = y1...yn. For all intermediate positions t < n, the model assigns a default value, such as 0. The actual, meaningful reward is calculated only upon completion of the entire sequence.