Concept

End-of-Sequence Reward Assignment in RLHF

When applying an RLHF reward model, the reward signal is typically sparse. A reward score is generated only at the final position of the output sequence y = y1...yn. For all intermediate positions t < n, the model assigns a default value, such as 0. The actual, meaningful reward is calculated only upon completion of the entire sequence.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences