Learn Before
Policy Update Analysis with Negative Advantage
Based on the provided data, analyze how the clipping mechanism alters the objective value for this token compared to what it would be without clipping. Explain the reasoning behind this alteration and its purpose in the training process.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective Formula for LLM Training in RLHF
Overall PPO Objective Function for Language Models
During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policy's probability to the reference policy's probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?
Policy Update Analysis with Negative Advantage
Analysis of Clipping Mechanism based on Advantage Sign