Learn Before
During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policy's probability to the reference policy's probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
PPO Objective Formula for LLM Training in RLHF
Overall PPO Objective Function for Language Models
During the training of a language model, the policy is updated based on a clipped objective function. Consider a single token generation step where the ratio of the current policy's probability to the reference policy's probability for a specific token is very large (e.g., 3.0), and the estimated advantage for generating this token is highly positive. The clipping range is set to [0.8, 1.2]. How does the clipping mechanism influence the calculation of the objective for this specific token?
Policy Update Analysis with Negative Advantage
Analysis of Clipping Mechanism based on Advantage Sign