1Cademy - Equivalence of Surrogate and On-Policy Gradients at the Reference Point

Learn Before

Formula

Equivalence of Surrogate and On-Policy Gradients at the Reference Point

A key property of the importance-sampled surrogate objective is that its gradient, when evaluated at the reference policy parameters ( $\theta = \theta_{\text{ref}}$ ), is identical to the standard on-policy policy gradient. This is demonstrated by moving the derivative inside the expectation, which is valid since the sampling distribution $\pi_{\theta_{\text{ref}}}$ does not depend on $\theta$ : $\frac{\partial}{\partial \theta} \left. \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] \right|_{\theta=\theta_{\text{ref}}} = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\frac{\partial}{\partial \theta} \text{Pr}_{\theta}(\tau)|_{\theta=\theta_{\text{ref}}}}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right]$ The term inside the expectation on the right, $\frac{\nabla_{\theta} \text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)}$ , when evaluated at $\theta=\theta_{\text{ref}}$ , is equivalent to $\nabla_{\theta} \log \text{Pr}_{\theta}(\tau)|_{\theta=\theta_{\text{ref}}}$ . Thus, the right-hand side is the definition of the on-policy policy gradient evaluated at the reference policy. This equivalence ensures that at the beginning of an optimization step, the update direction for the surrogate objective is the same as the true policy gradient.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After