1Cademy - An agent executes an identical sequence of states and actions in two different environments, A and B. The agents policy (π_θ) and a reference policy (π_θ_ref) are also the same in both scenarios. When calculating the approximated policy divergence penalty using the formula `Penalty = Σ [log π_θ(a_t|s_t) - log π_θ_ref(a_t|s_t)]`, the result is identical for both environments. What is the fundamental reason for this?

Learn Before

Approximated Policy Divergence Penalty Formula

Multiple Choice

An agent executes an identical sequence of states and actions in two different environments, A and B. The agent's policy (π_θ) and a reference policy (π_θ_ref) are also the same in both scenarios. When calculating the approximated policy divergence penalty using the formula Penalty = Σ [log π_θ(a_t|s_t) - log π_θ_ref(a_t|s_t)], the result is identical for both environments. What is the fundamental reason for this?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related