1Cademy - Value Model Update Frequency in RLHF

Learn Before

Joint Optimization of Policy and Value Functions in RLHF

Concept

Value Model Update Frequency in RLHF

During the joint optimization phase of RLHF, the value model is updated at each token position within a generated output sequence, rather than only at the end of the sequence.

Updated 2025-10-07

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?
Diagnosing Inefficient Language Model Training
During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the 'value model') is only updated once, after an entire sequence of text has been fully generated.

Learn Before

Related

Learn After