Calculating Policy Divergence Penalty
An optimization algorithm is updating a policy. For a specific trajectory, τ, the log-probability under the current policy, log π_θ(τ), is -2.5. The log-probability under the reference policy, log π_θ_ref(τ), is -4.0. Calculate the penalty used to measure the divergence between these two policies based on the difference in their log-probabilities.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Approximation of the Policy Divergence Penalty
Policy Divergence Penalty for Language Models
In a policy optimization process, a penalty is used to measure the change between a current policy, , and a reference policy, . The penalty is calculated for a specific sequence of actions and states (a trajectory, ) using the formula:
If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?
Calculating Policy Divergence Penalty
Interpreting Policy Divergence