1Cademy - During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the value model) is only updated once, after an entire sequence of text has been fully generated.

Learn Before

Value Model Update Frequency in RLHF

True/False

During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the 'value model') is only updated once, after an entire sequence of text has been fully generated.

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Comprehension in Revised Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?
Diagnosing Inefficient Language Model Training
During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the 'value model') is only updated once, after an entire sequence of text has been fully generated.

Learn Before

Related