Formula

Clipped Utility Function with Upper-Bound Clipping

This clipped utility function is a variation of the policy gradient objective that uses an upper-bound clip on the importance sampling ratio to stabilize training. For a trajectory τ, this utility is calculated by summing the product of the advantage function A(s_t, a_t) and the clipped policy probability ratio over all time steps t. The formula is: Uclip(τ;θ)=t=1TClip(πθ(atst)πθref(atst))A(st,at)U_{\text{clip}}(\tau; \theta) = \sum_{t=1}^{T} \text{Clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)} \right) A(s_t, a_t) The Clip function used here only applies an upper bound to the ratio (capping it at 1+ε). This limits how much the policy can be updated for actions with positive advantage, but does not apply a corresponding lower bound for actions with negative advantage, distinguishing it from the standard PPO clipped surrogate objective.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related