1Cademy - In the context of training a large language model to generate creative stories, increasing the frequency of automated reward signals (e.g., a reward for every grammatically correct sentence) is always more effective than providing a single, holistic quality rating from a human expert at the end of the entire story.

Strategy A: An automated system provides a small reward after every 5 words are generated, based on whether those words match a predefined vocabulary list.
Strategy B: A human expert reads the entire completed paragraph and provides a single, holistic quality score.

Learn Before

Effectiveness of Sparse but Informative Human Feedback in RLHF

True/False

In the context of training a large language model to generate creative stories, increasing the frequency of automated reward signals (e.g., a reward for every grammatically correct sentence) is always more effective than providing a single, holistic quality rating from a human expert at the end of the entire story.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related