1Cademy - PPO Clipped Surrogate Objective in RLHF

Learn Before

Use of Proximal Policy Optimization (PPO) in RLHF

Formula

PPO Clipped Surrogate Objective in RLHF

In the context of RLHF, the Proximal Policy Optimization (PPO) algorithm uses a clipped surrogate objective function to update the policy. This objective involves clipping the probability ratio of the current policy ( $\pi_{\theta}$ ) to a reference policy ( $\pi_{\theta_{ref}}$ ) and multiplying it by the advantage function ( $A$ ). This clipping mechanism helps to prevent large, destabilizing policy updates. The formula is: $U_{ppo-clip}(x, y; \theta) = \sum_{t=1}^{T} \text{Clip} \left( \frac{\pi_{\theta}(y_t|x, y_{<t})}{\pi_{\theta_{ref}}(y_t|x, y_{<t})} \right) A(x, y_{<t}, y_t)$ .

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After