Interpreting Policy Divergence
In a reinforcement learning optimization step, you are evaluating a potential update to your policy. You analyze a specific trajectory of actions and states, τ, to decide if the update is within an acceptable range. Using the log-probability difference as a penalty, you gather the following data. Based on this data, calculate the penalty and determine whether the optimization process would be encouraged or discouraged from making this policy update. Justify your reasoning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Approximation of the Policy Divergence Penalty
Policy Divergence Penalty for Language Models
In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
Calculating Policy Divergence Penalty
Interpreting Policy Divergence