1Cademy - Policy Gradient Objective with Importance Sampling

Learn Before

Refining Utility Estimation with Importance Sampling in Policy Gradients

Formula

Policy Gradient Objective with Importance Sampling

To enhance the stability and reliability of policy gradient methods, importance sampling is frequently employed to refine the estimation of the utility function $U(\tau; \theta)$ . By incorporating a previous or reference policy $\pi_{\theta_{\mathrm{ref}}}$ , the refined utility for a trajectory $\tau$ is calculated as $U(\tau; \theta) = \sum_{t=1}^{T} \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)} A(s_t, a_t)$ . The ratio term $\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\mathrm{ref}}}(a_t|s_t)}$ serves as an importance sampling weight that scales the advantage $A(s_t, a_t)$ to account for the difference between the active policy being optimized and the reference baseline.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After