Case Study

Debugging a Value Model's Loss Calculation

An engineer is training a model to predict the cumulative quality (value) of a generated text sequence. The model is updated at each step of the generation process. The engineer observes that the model's predictions are unstable and not improving. Upon reviewing the implementation, you find the squared error for the value prediction at a given step t is being calculated as:

Error = (reward_t - V_predicted_t)^2

Where reward_t is the immediate quality score at step t and V_predicted_t is the model's value prediction for that step.

Based on the standard objective for training such a predictive value model, what critical term is missing from the calculation of the target value (i.e., what V_predicted_t is being compared against)? Explain why the absence of this term leads to the observed training instability.

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science