1Cademy - During a reinforcement learning update for a language model, the value function is trained to predict future rewards. At a specific step, the value functions output for the current state is `V_current = 3.0`. The model then generates a token, for which a reward model provides a score of `r = 0.5`. The value functions output for the new state is `V_next = 4.0`. Assuming a discount factor of `γ = 0.9`, the training objective is to minimize the squared difference between `V_current` and a target value. Based on these figures, what does the training objective imply about the initial prediction `V

Learn Before

Value Function Loss Minimization in RLHF

Multiple Choice

During a reinforcement learning update for a language model, the value function is trained to predict future rewards. At a specific step, the value function's output for the current state is V_current = 3.0. The model then generates a token, for which a reward model provides a score of r = 0.5. The value function's output for the new state is V_next = 4.0. Assuming a discount factor of γ = 0.9, the training objective is to minimize the squared difference between V_current and a target value. Based on these figures, what does the training objective imply about the initial prediction V_current?

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related