1Cademy - PPO Clipped Objective for Language Models

Learn Before

Clipped Utility Function with Upper-Bound Clipping

Formula

PPO Clipped Objective for Language Models

In the context of training language models with PPO, the clipped surrogate objective, denoted as $U_{\text{ppo-clip}}$ , is calculated by summing over the generated tokens. For each token $y_t$ in the response $\mathbf{y}$ , the objective considers the ratio of probabilities between the current policy $\pi_{\theta}$ and a reference policy $\pi_{\theta_{\text{ref}}}$ . This ratio is clipped to prevent large policy updates and then multiplied by the advantage function $A$ . The formula is: $U_{\text{ppo-clip}}(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} \text{Clip}\left(\frac{\pi_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t})}{\pi_{\theta_{\text{ref}}}(y_t|\mathbf{x}, \mathbf{y}_{<t})}\right) A(\mathbf{x}, \mathbf{y}_{<t}, y_t)$

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After