1Cademy - Clipped Utility Function with Upper-Bound Clipping

Learn Before

Formula

Clipped Utility Function with Upper-Bound Clipping

This clipped utility function is a variation of the policy gradient objective that uses an upper-bound clip on the importance sampling ratio to stabilize training. For a trajectory τ, this utility is calculated by summing the product of the advantage function A(s_t, a_t) and the clipped policy probability ratio over all time steps t. The formula is: $U_{\text{clip}}(\tau; \theta) = \sum_{t=1}^{T} \text{Clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)} \right) A(s_t, a_t)$ The Clip function used here only applies an upper bound to the ratio (capping it at 1+ε). This limits how much the policy can be updated for actions with positive advantage, but does not apply a corresponding lower bound for actions with negative advantage, distinguishing it from the standard PPO clipped surrogate objective.

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After