Concept

Rationale for End-of-Sequence Rewards in RLHF

The adoption of end-of-sequence rewards in RLHF is a strategic choice rooted in the nature of its tasks, which involve complex linguistic and cognitive processes rather than dynamic environmental interactions. In such contexts, evaluating individual actions is challenging, as their quality can only be determined within the full scope of the completed sequence. This makes frequent, meaningful intermediate rewards impractical. Instead, RLHF relies on a single, sparse reward signal provided at the end of a task. Although infrequent, this human-provided feedback is highly informative and accurate, enabling a learning process that is both robust and efficient.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related