Learn Before
A language model is being trained using a reinforcement learning objective. For each generated token, part of this objective is calculated as: Clip(probability_ratio) * Advantage. The probability_ratio is the likelihood of generating the token under the new policy divided by the likelihood under the old policy, and Advantage is an estimate of how much better that token was than the expected average. In a particular training step for a token y, the Advantage is strongly positive, and the probability_ratio is already high (e.g., 1.5, where the clipping threshold is 1.2). How does the Clip function influence the update to the model's policy for generating token y?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing LLM Training Instability
A team is fine-tuning a large language model using a reinforcement learning objective that includes a clipped probability ratio multiplied by an advantage estimate, and a penalty term based on the divergence from a reference model. During training, they observe that while the model's average reward is increasing, its outputs are becoming nonsensical and repetitive, losing the general language capabilities of the original model. Which of the following is the most likely cause of this issue?
A language model is being trained using a reinforcement learning objective. For each generated token, part of this objective is calculated as:
Clip(probability_ratio) * Advantage. Theprobability_ratiois the likelihood of generating the token under the new policy divided by the likelihood under the old policy, andAdvantageis an estimate of how much better that token was than the expected average. In a particular training step for a tokeny, theAdvantageis strongly positive, and theprobability_ratiois already high (e.g., 1.5, where the clipping threshold is 1.2). How does theClipfunction influence the update to the model's policy for generating tokeny?