In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Approximation of the Policy Divergence Penalty
Policy Divergence Penalty for Language Models
In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
Calculating Policy Divergence Penalty
Interpreting Policy Divergence