Learn Before
A value model is trained using a loss function that minimizes the squared difference between its current value prediction, , and a target value calculated as the sum of the immediate reward and the discounted value of the next state, . Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Debugging a Value Model's Loss Calculation
A value model is trained using a loss function that minimizes the squared difference between its current value prediction, , and a target value calculated as the sum of the immediate reward and the discounted value of the next state, . Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?
A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the model's prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the model's parameters, gradients are computed with respect to both the model's prediction for the current state and its prediction for the next state (which is part of the target).