1Cademy - A language model is being trained using a reward model that provides a single quality score for a complete, generated response. If the model generates the four-token sequence [The, cat, sat, .], which of the following reward lists best represents the standard sparse reward assignment for this process, where `r

Learn Before

End-of-Sequence Reward Assignment in RLHF

Multiple Choice

A language model is being trained using a reward model that provides a single quality score for a complete, generated response. If the model generates the four-token sequence ['The', 'cat', 'sat', '.'], which of the following reward lists best represents the standard sparse reward assignment for this process, where r_t is the reward at step t?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related