1Cademy - Effectiveness of Sparse but Informative Human Feedback in RLHF

Strategy A: An automated system provides a small reward after every 5 words are generated, based on whether those words match a predefined vocabulary list.
Strategy B: A human expert reads the entire completed paragraph and provides a single, holistic quality score.

Learn Before

Rationale for End-of-Sequence Rewards in RLHF

Concept

Effectiveness of Sparse but Informative Human Feedback in RLHF

Although the reward signals in RLHF are sparse, typically provided only once per sequence, they are highly effective for training. This is because the feedback, originating from human judgment, is very informative and accurate. The combination of sparsity with high-quality signals allows for a learning process that is both robust and efficient.

Updated 2025-10-10

Contributors are: