Learn Before
Value Function Loss in RLHF
The value model in RLHF, which estimates the expected future reward from a given state, is trained simultaneously with the policy model. Its training objective is to minimize the Mean Squared Error (MSE) between its predicted state value and a target value computed from the reward model. This is effectively minimizing the squared Temporal Difference (TD) error. The loss function is:
where is the value function with parameters , and the target is considered a fixed value during the gradient calculation for this loss.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Value Function Loss in RLHF
An AI system is being trained to generate helpful multi-turn dialogues. A state-value function, which estimates the total future reward from the current point in the conversation, is updated using rewards from a separate reward model. The development team observes that the value function consistently assigns very low values to all conversational turns except the very last one, even when the intermediate turns are crucial for a successful outcome. This causes the AI to prematurely end conversations. Which of the following is the most likely cause of this specific issue?
Impact of a Biased Reward Model on Value Function Training
Advantage Function as TD Error in RLHF
Arrange the following events in the correct chronological order to describe a single update step for a value function that relies on a separate reward model.
Learn After
Debugging a Value Model's Loss Calculation
A value model is trained using a loss function that minimizes the squared difference between its current value prediction, , and a target value calculated as the sum of the immediate reward and the discounted value of the next state, . Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?
A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the model's prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the model's parameters, gradients are computed with respect to both the model's prediction for the current state and its prediction for the next state (which is part of the target).