PPO Clipped Objective for Language Models
In the context of training language models with PPO, the clipped surrogate objective, denoted as , is calculated by summing over the generated tokens. For each token in the response , the objective considers the ratio of probabilities between the current policy and a reference policy . This ratio is clipped to prevent large policy updates and then multiplied by the advantage function . The formula is:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective
PPO Clipped Objective for Language Models
A reinforcement learning agent is being trained using a utility function that incorporates an upper-bound clip on the policy probability ratio, defined as
min(ratio, 1+ε), whereεis a small positive constant. Consider two distinct actions taken during an episode:- Action A: Has a large positive advantage, and its probability ratio is
2.0. - Action B: Has a large negative advantage, and its probability ratio is
0.1.
Assuming
ε = 0.2, how does this specific clipping mechanism influence the policy update derived from these two actions?- Action A: Has a large positive advantage, and its probability ratio is
A utility function that modifies the policy probability ratio
r_tusing the operationmin(r_t, 1+ε)is primarily intended to mitigate training instability caused by actions that are discovered to be substantially worse than the reference policy's actions (i.e., actions with a large negative advantage).Stabilizing Policy Gradient Training
Learn After
PPO Objective Formula for LLM Training in RLHF
Overall PPO Objective Function for Language Models
During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policy's probability to the reference policy's probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?
Policy Update Analysis with Negative Advantage
Analysis of Clipping Mechanism based on Advantage Sign