Formula

Value Function Loss in RLHF

The value model in RLHF, which estimates the expected future reward from a given state, is trained simultaneously with the policy model. Its training objective is to minimize the Mean Squared Error (MSE) between its predicted state value and a target value computed from the reward model. This is effectively minimizing the squared Temporal Difference (TD) error. The loss function is:

L(ω)=1MxDt=1T(rt+γVω(x,y<t+1)Vω(x,y<t))2\mathcal{L}(\omega) = \frac{1}{M} \sum_{x \in D} \sum_{t=1}^{T} (r_t + \gamma V_\omega(x, y_{<t+1}) - V_\omega(x, y_{<t}))^2

where VωV_\omega is the value function with parameters ω\omega, and the target rt+γVω(x,y<t+1)r_t + \gamma V_\omega(x, y_{<t+1}) is considered a fixed value during the gradient calculation for this loss.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences