1Cademy - Equivalence of the Surrogate Objective and the On-Policy Objective

Learn Before

Formula

Equivalence of the Surrogate Objective and the On-Policy Objective

The surrogate objective function, which evaluates a policy $\pi_\theta$ using trajectories sampled from a reference policy $\pi_{\theta_{\mathrm{ref}}}$ , is mathematically equivalent to the true on-policy objective. This equivalence is established by expanding the expectation of the importance-sampled reward into its summation form, where the probability of a trajectory under the reference policy cancels out: $\mathbb{E}_{\tau \sim \pi_{\theta_{\mathrm{ref}}}} \Big[ \frac{\mathrm{Pr}_{\theta}(\tau)}{\mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau)} R(\tau) \Big] = \sum_{\tau} \mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau) \frac{\mathrm{Pr}_{\theta}(\tau)}{\mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau)} R(\tau) = \sum_{\tau} \mathrm{Pr}_{\theta}(\tau) R(\tau) = \mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)]$ . However, this strict mathematical equivalence holds only when the expectation is performed over the entire sequence space. In practice, because policy learning models often sample a relatively small subset of sequences, the sampling methodology itself significantly influences the resulting estimates.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After