Formula

Advantage Function as TD Error in RLHF

In RLHF, the advantage function, denoting the advantage of taking action ata_t given state sts_t, is commonly estimated using the Temporal Difference (TD) error. This estimate is used in both policy and value function updates. It is calculated by taking the immediate reward rtr_t, adding the discounted expected value of the next state ÎŗV(st+1)\gamma V(s_{t+1}), and subtracting the estimated value of the current state V(st)V(s_t). The formula is: A(st,at)=rt+ÎŗV(st+1)−V(st)A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t). The state value function V(st)V(s_t) is typically trained concurrently using the reward model.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related