1Cademy - A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates: - The value (expected future rewards) of the current state is 1.2. - After generating a specific token, the immediate reward received is +0.5. - The value of the new state after generating the token is 1.0. - The discount factor for future rewards is 0.9. Based on the standard temporal difference method for estimating the advantage, what is the advantage of taking this action, and what does it imply?

Learn Before

Advantage Function Estimation in RLHF

Multiple Choice

A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates:

The value (expected future rewards) of the current state is 1.2.
After generating a specific token, the immediate reward received is +0.5.
The value of the new state after generating the token is 1.0.
The discount factor for future rewards is 0.9.

Based on the standard temporal difference method for estimating the advantage, what is the advantage of taking this action, and what does it imply?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related