Formula

PPO Clipped Surrogate Objective in RLHF

In the context of RLHF, the Proximal Policy Optimization (PPO) algorithm uses a clipped surrogate objective function to update the policy. This objective involves clipping the probability ratio of the current policy (πθ\pi_{\theta}) to a reference policy (πθref\pi_{\theta_{ref}}) and multiplying it by the advantage function (AA). This clipping mechanism helps to prevent large, destabilizing policy updates. The formula is: Uppoclip(x,y;θ)=t=1TClip(πθ(ytx,y<t)πθref(ytx,y<t))A(x,y<t,yt)U_{ppo-clip}(x, y; \theta) = \sum_{t=1}^{T} \text{Clip} \left( \frac{\pi_{\theta}(y_t|x, y_{<t})}{\pi_{\theta_{ref}}(y_t|x, y_{<t})} \right) A(x, y_{<t}, y_t).

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences