Learn Before
Stabilizing Policy Updates with a Divergence Penalty
By incorporating a policy divergence penalty into the optimization objective, the learning process is stabilized. This penalty discourages the current policy from straying too far from a reference policy, thereby limiting excessively large updates that could disrupt training.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Log-Probability Difference as a Policy Divergence Penalty
An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function,
J(θ), to a new objective function,J_new(θ) = J(θ) - β * D(θ, θ_old), whereθrepresents the current policy parameters,θ_oldrepresents the parameters from the previous iteration,Dis a function that measures the divergence between the two sets of parameters (a larger value means more divergence), andβis a positive coefficient. During optimization, the goal is to maximizeJ_new(θ). What is the primary effect of the- β * D(θ, θ_old)term on the training process?Stabilizing Reinforcement Learning Training
Choosing an Objective Function for Stable Policy Updates
Stabilizing Policy Updates with a Divergence Penalty
When implementing a penalty-based trust region for policy optimization where the goal is to maximize the objective function, increasing the weight of the penalty term will expand the trusted area, allowing the policy to make larger updates.
Learn After
An engineer is fine-tuning a language model and observes that the training process is highly unstable. The model's performance fluctuates wildly, and the training loss sometimes spikes dramatically, suggesting the policy updates are too aggressive. Which of the following modifications to the optimization objective is most specifically designed to counteract this problem by directly constraining the magnitude of policy changes at each step?
Stabilizing an Erratic Training Process
Analyzing the Impact of a Policy Divergence Penalty