Learn Before
An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function, J(θ), to a new objective function, J_new(θ) = J(θ) - β * D(θ, θ_old), where θ represents the current policy parameters, θ_old represents the parameters from the previous iteration, D is a function that measures the divergence between the two sets of parameters (a larger value means more divergence), and β is a positive coefficient. During optimization, the goal is to maximize J_new(θ). What is the primary effect of the - β * D(θ, θ_old) term on the training process?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Log-Probability Difference as a Policy Divergence Penalty
An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function,
J(θ), to a new objective function,J_new(θ) = J(θ) - β * D(θ, θ_old), whereθrepresents the current policy parameters,θ_oldrepresents the parameters from the previous iteration,Dis a function that measures the divergence between the two sets of parameters (a larger value means more divergence), andβis a positive coefficient. During optimization, the goal is to maximizeJ_new(θ). What is the primary effect of the- β * D(θ, θ_old)term on the training process?Stabilizing Reinforcement Learning Training
Choosing an Objective Function for Stable Policy Updates
Stabilizing Policy Updates with a Divergence Penalty
When implementing a penalty-based trust region for policy optimization where the goal is to maximize the objective function, increasing the weight of the penalty term will expand the trusted area, allowing the policy to make larger updates.