1Cademy - A value model is trained using a loss function that minimizes the squared difference between its current value prediction, $V(s_t)$, and a target value calculated as the sum of the immediate reward and the discounted value of the next state, $r_t + \gamma V(s_{t+1})$. Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?

Learn Before

Value Function Loss in RLHF

Multiple Choice

A value model is trained using a loss function that minimizes the squared difference between its current value prediction, $V(s_t)$ , and a target value calculated as the sum of the immediate reward and the discounted value of the next state, $r_t + \gamma V(s_{t+1})$ . Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related