1Cademy - Advantage Function as TD Error in RLHF

Learn Before

Joint Optimization of Policy and Value Functions in RLHF
Training the Value Function with a Reward Model

Formula

Advantage Function as TD Error in RLHF

In RLHF, the advantage function, denoting the advantage of taking action $a_t$ given state $s_t$ , is commonly estimated using the Temporal Difference (TD) error. This estimate is used in both policy and value function updates. It is calculated by taking the immediate reward $r_t$ , adding the discounted expected value of the next state $\gamma V(s_{t+1})$ , and subtracting the estimated value of the current state $V(s_t)$ . The formula is: $A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$ . The state value function $V(s_t)$ is typically trained concurrently using the reward model.