Short Answer

Calculating the Advantage for a Single Token Generation

During the fine-tuning of a language model, at a specific step t, the model has generated the sequence y_<t> based on an initial prompt x. The value function estimates the value of this state, V(x, y_<t>), to be 0.5. The model then generates the next token, y_t, and receives an immediate reward r_t of 0.1 from a reward model. The value function's estimate for the new state, V(x, y_<t+1>), is 0.8. Assuming a discount factor γ of 0.9, calculate the advantage A_t for this step. Show your calculation.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science