1Cademy - During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policys probability to the reference policys probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?

Learn Before

PPO Clipped Objective for Language Models

Multiple Choice

During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policy's probability to the reference policy's probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related