1Cademy - Approximated Policy Divergence Penalty Formula

Learn Before

Approximation of the Policy Divergence Penalty

Formula

Approximated Policy Divergence Penalty Formula

In practice, the policy divergence penalty is approximated by summing the differences in log-probabilities for each action-state pair over a trajectory, rather than using the log-probability of the entire trajectory. This simplification ignores the environment's dynamics. The formula is: $\text{Penalty} = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) - \sum_{t=1}^{T} \log \pi_{\theta_{\text{ref}}}(a_t|s_t)$

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn After

An agent executes an identical sequence of states and actions in two different environments, A and B. The agent's policy (π_θ) and a reference policy (π_θ_ref) are also the same in both scenarios. When calculating the approximated policy divergence penalty using the formula Penalty = Σ [log π_θ(a_t|s_t) - log π_θ_ref(a_t|s_t)], the result is identical for both environments. What is the fundamental reason for this?
Calculating Approximated Policy Divergence
Consider the approximated policy divergence penalty formula: $\text{Penalty} = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) - \sum_{t=1}^{T} \log \pi_{\theta_{\text{ref}}}(a_t|s_t)$ This penalty's value for a fixed trajectory of states and actions is sensitive to changes in the environment's transition dynamics.

Learn Before

Related

Learn After