Formula

Upper-Bound Clipping Function for Policy Ratios

This clipping function is used in some variants of policy gradient algorithms to constrain the policy probability ratio, πθ(atst)πθref(atst)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}, from becoming too large. It is defined as the minimum of the original ratio and the ratio bounded within [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]: Clip(πθ(atst)πθref(atst))=min(πθ(atst)πθref(atst),bound(πθ(atst)πθref(atst),1ϵ,1+ϵ))\text{Clip}\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}\right) = \min\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}, \text{bound}\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)\right) This operation is mathematically equivalent to taking min(ratio, 1+ε), which effectively only applies an upper bound to the ratio. It is used to prevent the policy from making excessively large updates when an action has a positive advantage.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related