Learn Before
Consider the approximated policy divergence penalty formula: This penalty's value for a fixed trajectory of states and actions is sensitive to changes in the environment's transition dynamics.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An agent executes an identical sequence of states and actions in two different environments, A and B. The agent's policy (π_θ) and a reference policy (π_θ_ref) are also the same in both scenarios. When calculating the approximated policy divergence penalty using the formula
Penalty = Σ [log π_θ(a_t|s_t) - log π_θ_ref(a_t|s_t)], the result is identical for both environments. What is the fundamental reason for this?Calculating Approximated Policy Divergence
Consider the approximated policy divergence penalty formula: This penalty's value for a fixed trajectory of states and actions is sensitive to changes in the environment's transition dynamics.