Learn Before
Diagnosing LLM Training Instability
An AI development team is fine-tuning a large language model using a reinforcement learning approach. They observe that while the model's outputs initially align better with desired criteria, it soon begins to generate repetitive, low-quality, and nonsensical text. The team's policy optimization objective function contains two key parts: 1) a term that encourages actions with high advantage, clipped to prevent overly large updates, and 2) a penalty term, controlled by a coefficient β, that measures how much the current policy has diverged from a stable, initial reference policy.
Given the observed training behavior, which part of the objective function is likely misconfigured? Specifically, is the β coefficient most likely set too high or too low? Justify your reasoning by explaining the role of the penalty term in the training process.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing LLM Training Instability
A team is fine-tuning a large language model using a reinforcement learning objective that includes a clipped probability ratio multiplied by an advantage estimate, and a penalty term based on the divergence from a reference model. During training, they observe that while the model's average reward is increasing, its outputs are becoming nonsensical and repetitive, losing the general language capabilities of the original model. Which of the following is the most likely cause of this issue?
A language model is being trained using a reinforcement learning objective. For each generated token, part of this objective is calculated as:
Clip(probability_ratio) * Advantage. Theprobability_ratiois the likelihood of generating the token under the new policy divided by the likelihood under the old policy, andAdvantageis an estimate of how much better that token was than the expected average. In a particular training step for a tokeny, theAdvantageis strongly positive, and theprobability_ratiois already high (e.g., 1.5, where the clipping threshold is 1.2). How does theClipfunction influence the update to the model's policy for generating tokeny?