1Cademy - A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the models prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the models parameters, gradients are computed with respect to both the models prediction for the current state and its prediction for the next state (which is part of the target).

Learn Before

Value Function Loss in RLHF

True/False

A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the model's prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the model's parameters, gradients are computed with respect to both the model's prediction for the current state and its prediction for the next state (which is part of the target).

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related