Learn Before
A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates:
- The value (expected future rewards) of the current state is 1.2.
- After generating a specific token, the immediate reward received is +0.5.
- The value of the new state after generating the token is 1.0.
- The discount factor for future rewards is 0.9.
Based on the standard temporal difference method for estimating the advantage, what is the advantage of taking this action, and what does it imply?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Value Function Loss Minimization in RLHF
A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates:
- The value (expected future rewards) of the current state is 1.2.
- After generating a specific token, the immediate reward received is +0.5.
- The value of the new state after generating the token is 1.0.
- The discount factor for future rewards is 0.9.
Based on the standard temporal difference method for estimating the advantage, what is the advantage of taking this action, and what does it imply?
Policy Improvement Decision
Interpreting the Advantage Function