1Cademy - In a reinforcement learning process for training a language model, a value model is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value models parameters after *each token* is generated, rather than only once at the end of the complete sequence?

Learn Before

Value Model Update Frequency in RLHF

Multiple Choice

In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related