Formula

Advantage Function Estimation in RLHF

In the context of policy optimization algorithms like PPO used in RLHF, the advantage function, denoted as AtA_t, quantifies the relative value of taking a specific action at a given state. It is commonly estimated using the Temporal Difference (TD) error. The formula for this estimation is:

At=rt+γVω(x,y<t+1)Vω(x,y<t)A_t = r_t + \gamma V_\omega(x, y_{<t+1}) - V_\omega(x, y_{<t})

Here, rtr_t is the reward provided by the reward model, VωV_\omega is the value function, and γ\gamma is the discount factor.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences