Learn Before
Diagnosing Inefficient Language Model Training
Given the following case study of a language model training process, identify the most critical change to the value model's update schedule to improve training stability and efficiency, and justify your reasoning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?
Diagnosing Inefficient Language Model Training
During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the 'value model') is only updated once, after an entire sequence of text has been fully generated.