1Cademy - Value Function Loss Minimization in RLHF

Learn Before

Formula

Value Function Loss Minimization in RLHF

The value function, parameterized by $\omega$ , is trained alongside the policy to estimate the expected future reward from a given state. Its parameters are updated by minimizing the Mean Squared Error (MSE) between the predicted state value, $V_\omega(\mathbf{x},y_{<t})$ , and the computed return. The computed return is the sum of the immediate reward, $r_t$ , and the discounted value of the next state, $\gamma V_\omega(\mathbf{x},y_{<t+1})$ . The loss function is averaged over a dataset $\mathcal{D}$ and all token positions $T$ :

$\min_{\omega} \frac{1}{M} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{t=1}^{T} \left( r_t + \gamma V_\omega(\mathbf{x},y_{<t+1}) - V_\omega(\mathbf{x},y_{<t}) \right)^2$

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After