Formula

PPO Clipped Objective for Language Models

In the context of training language models with PPO, the clipped surrogate objective, denoted as Uppo-clipU_{\text{ppo-clip}}, is calculated by summing over the generated tokens. For each token yty_t in the response y\mathbf{y}, the objective considers the ratio of probabilities between the current policy πθ\pi_{\theta} and a reference policy πθref\pi_{\theta_{\text{ref}}}. This ratio is clipped to prevent large policy updates and then multiplied by the advantage function AA. The formula is: Uppo-clip(x,y;θ)=t=1TClip(πθ(ytx,y<t)πθref(ytx,y<t))A(x,y<t,yt)U_{\text{ppo-clip}}(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} \text{Clip}\left(\frac{\pi_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t})}{\pi_{\theta_{\text{ref}}}(y_t|\mathbf{x}, \mathbf{y}_{<t})}\right) A(\mathbf{x}, \mathbf{y}_{<t}, y_t)

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences