Learn Before
Debugging a Value Model's Loss Calculation
An engineer is training a model to predict the cumulative quality (value) of a generated text sequence. The model is updated at each step of the generation process. The engineer observes that the model's predictions are unstable and not improving. Upon reviewing the implementation, you find the squared error for the value prediction at a given step t is being calculated as:
Error = (reward_t - V_predicted_t)^2
Where reward_t is the immediate quality score at step t and V_predicted_t is the model's value prediction for that step.
Based on the standard objective for training such a predictive value model, what critical term is missing from the calculation of the target value (i.e., what V_predicted_t is being compared against)? Explain why the absence of this term leads to the observed training instability.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Debugging a Value Model's Loss Calculation
A value model is trained using a loss function that minimizes the squared difference between its current value prediction, , and a target value calculated as the sum of the immediate reward and the discounted value of the next state, . Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?
A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the model's prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the model's parameters, gradients are computed with respect to both the model's prediction for the current state and its prediction for the next state (which is part of the target).