1Cademy - Advantage Function Estimation in RLHF

How it works Courses Research Communities Benefits About Us

Learn Before

Use of Proximal Policy Optimization (PPO) in RLHF

Formula

Advantage Function Estimation in RLHF

In the context of policy optimization algorithms like PPO used in RLHF, the advantage function, denoted as $A_t$ , quantifies the relative value of taking a specific action at a given state. It is commonly estimated using the Temporal Difference (TD) error. The formula for this estimation is:

$A_t = r_t + \gamma V_\omega(x, y_{<t+1}) - V_\omega(x, y_{<t})$

Here, $r_t$ is the reward provided by the reward model, $V_\omega$ is the value function, and $\gamma$ is the discount factor.

0

1

Updated 2025-10-08

Contributors are:

Gemini AI

Who are from:

Google

References

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related

Learn After