1Cademy - Debugging a Value Models Loss Calculation

Learn Before

Value Function Loss in RLHF

Case Study

Debugging a Value Model's Loss Calculation

An engineer is training a model to predict the cumulative quality (value) of a generated text sequence. The model is updated at each step of the generation process. The engineer observes that the model's predictions are unstable and not improving. Upon reviewing the implementation, you find the squared error for the value prediction at a given step t is being calculated as:

Error = (reward_t - V_predicted_t)^2

Where reward_t is the immediate quality score at step t and V_predicted_t is the model's value prediction for that step.

Based on the standard objective for training such a predictive value model, what critical term is missing from the calculation of the target value (i.e., what V_predicted_t is being compared against)? Explain why the absence of this term leads to the observed training instability.

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related