Learn Before
Log-Probability Difference as a Policy Divergence Penalty
A simple way to implement a penalty for policy divergence in trust region optimization is to calculate the difference between the log-probabilities of a trajectory, . This penalty measures the change between the current policy, , and a reference policy, , using the following formula:
This value provides a straightforward measure of how much the policy has deviated.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Log-Probability Difference as a Policy Divergence Penalty
An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function,
J(θ), to a new objective function,J_new(θ) = J(θ) - β * D(θ, θ_old), whereθrepresents the current policy parameters,θ_oldrepresents the parameters from the previous iteration,Dis a function that measures the divergence between the two sets of parameters (a larger value means more divergence), andβis a positive coefficient. During optimization, the goal is to maximizeJ_new(θ). What is the primary effect of the- β * D(θ, θ_old)term on the training process?Stabilizing Reinforcement Learning Training
Choosing an Objective Function for Stable Policy Updates
Stabilizing Policy Updates with a Divergence Penalty
When implementing a penalty-based trust region for policy optimization where the goal is to maximize the objective function, increasing the weight of the penalty term will expand the trusted area, allowing the policy to make larger updates.
Learn After
Approximation of the Policy Divergence Penalty
Policy Divergence Penalty for Language Models
In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
Calculating Policy Divergence Penalty
Interpreting Policy Divergence