1Cademy - Upper-Bound Clipping Function for Policy Ratios

Learn Before

Formula

Upper-Bound Clipping Function for Policy Ratios

This clipping function is used in some variants of policy gradient algorithms to constrain the policy probability ratio, $\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}$ , from becoming too large. It is defined as the minimum of the original ratio and the ratio bounded within $[1-\epsilon, 1+\epsilon]$ : $\text{Clip}\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}\right) = \min\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}, \text{bound}\left(\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{ref}}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)\right)$ This operation is mathematically equivalent to taking min(ratio, 1+ε), which effectively only applies an upper bound to the ratio. It is used to prevent the policy from making excessively large updates when an action has a positive advantage.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After