1Cademy - Calculating the Advantage for a Single Token Generation

Learn Before

Advantage Function as TD Error in RLHF

Short Answer

Calculating the Advantage for a Single Token Generation

During the fine-tuning of a language model, at a specific step t, the model has generated the sequence y_<t> based on an initial prompt x. The value function estimates the value of this state, V(x, y_<t>), to be 0.5. The model then generates the next token, y_t, and receives an immediate reward r_t of 0.1 from a reward model. The value function's estimate for the new state, V(x, y_<t+1>), is 0.8. Assuming a discount factor γ of 0.9, calculate the advantage A_t for this step. Show your calculation.

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related