1Cademy - Calculating Policy Divergence Penalty

Learn Before

Log-Probability Difference as a Policy Divergence Penalty

Short Answer

Calculating Policy Divergence Penalty

An optimization algorithm is updating a policy. For a specific trajectory, τ, the log-probability under the current policy, log π_θ(τ), is -2.5. The log-probability under the reference policy, log π_θ_ref(τ), is -4.0. Calculate the penalty used to measure the divergence between these two policies based on the difference in their log-probabilities.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related