Multiple Choice

During a reinforcement learning update for a language model, the value function is trained to predict future rewards. At a specific step, the value function's output for the current state is V_current = 3.0. The model then generates a token, for which a reward model provides a score of r = 0.5. The value function's output for the new state is V_next = 4.0. Assuming a discount factor of γ = 0.9, the training objective is to minimize the squared difference between V_current and a target value. Based on these figures, what does the training objective imply about the initial prediction V_current?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science