Formula

Clipped Surrogate Objective Function

To address the high variance and resultant instability in policy gradient estimates, a clipped surrogate objective function is widely used. This objective incorporates a clipping mechanism to bound the importance weights, ensuring that individual policy updates do not become excessively large. The clipped utility function is formally defined as: Uclip(τ;θ)=t=1TClip(πθ(atst)πθref(atst))A(st,at)U_{\mathrm{clip}}(\tau;\theta) = \sum_{t=1}^{T} \mathrm{Clip}\Big( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} \Big) A(s_t,a_t) where the clipping operation restricts the probability ratio using a specified boundary hyperparameter ϵ\epsilon: Clip(πθ(atst)πθref(atst))=min(πθ(atst)πθref(atst),bound(πθ(atst)πθref(atst),1ϵ,1+ϵ))\mathrm{Clip}\Big( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} \Big) = \min\Big( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)},\mathrm{bound} \big(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon \big) \Big).

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences