Learn Before
Multiple Choice

In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science