Formula

Value Function Loss Minimization in RLHF

The value function, parameterized by ω\omega, is trained alongside the policy to estimate the expected future reward from a given state. Its parameters are updated by minimizing the Mean Squared Error (MSE) between the predicted state value, Vω(x,y<t)V_\omega(\mathbf{x},y_{<t}), and the computed return. The computed return is the sum of the immediate reward, rtr_t, and the discounted value of the next state, γVω(x,y<t+1)\gamma V_\omega(\mathbf{x},y_{<t+1}). The loss function is averaged over a dataset D\mathcal{D} and all token positions TT:

minω1MxDt=1T(rt+γVω(x,y<t+1)Vω(x,y<t))2\min_{\omega} \frac{1}{M} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{t=1}^{T} \left( r_t + \gamma V_\omega(\mathbf{x},y_{<t+1}) - V_\omega(\mathbf{x},y_{<t}) \right)^2

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related