Learn Before
During a policy update using a clipped surrogate objective, the advantage for a specific token is calculated to be negative (e.g., -2.5), indicating it's a poor choice. The probability ratio for this token is very low (e.g., 0.5), meaning the new policy is much less likely to produce this token than the reference policy. Given a clipping range of [0.8, 1.2], what is the primary effect of the clipping mechanism on the policy update for this token?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a policy optimization step for a language model, the advantage for a particular generated token is calculated to be large and positive, indicating a highly desirable token. Simultaneously, the probability ratio (current policy's probability / reference policy's probability) for this token is significantly greater than 1 (e.g., 3.0). How does a clipping mechanism within the optimization objective function influence the resulting policy update for this token, and what is the primary reason for this influence?
During a policy update using a clipped surrogate objective, the advantage for a specific token is calculated to be negative (e.g., -2.5), indicating it's a poor choice. The probability ratio for this token is very low (e.g., 0.5), meaning the new policy is much less likely to produce this token than the reference policy. Given a clipping range of [0.8, 1.2], what is the primary effect of the clipping mechanism on the policy update for this token?
Evaluating the Clipping Range in Policy Optimization