1Cademy - In a policy optimization process, a penalty is used to measure the change between a current policy, $\pi_{\theta}$, and a reference policy, $\pi_{\theta_{\text{ref}}}$. The penalty is calculated for a specific sequence of actions and states (a trajectory, $\tau$) using the formula: <br><br>$$ \text{Penalty} = \log \pi_{\theta}(\tau) - \log \pi_{\theta_{\text{ref}}}(\tau) $$ <br><br>If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?

Learn Before

Log-Probability Difference as a Policy Divergence Penalty

Multiple Choice

In a policy optimization process, a penalty is used to measure the change between a current policy, $\pi_{\theta}$ , and a reference policy, $\pi_{\theta_{\text{ref}}}$ . The penalty is calculated for a specific sequence of actions and states (a trajectory, $\tau$ ) using the formula:

$\text{Penalty} = \log \pi_{\theta}(\tau) - \log \pi_{\theta_{\text{ref}}}(\tau)$

If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related