Formula

Equivalence of Surrogate and On-Policy Gradients at the Reference Point

A key property of the importance-sampled surrogate objective is that its gradient, when evaluated at the reference policy parameters (θ=θref\theta = \theta_{\text{ref}}), is identical to the standard on-policy policy gradient. This is demonstrated by moving the derivative inside the expectation, which is valid since the sampling distribution πθref\pi_{\theta_{\text{ref}}} does not depend on θ\theta: θEτπθref[Prθ(τ)Prθref(τ)R(τ)]θ=θref=Eτπθref[θPrθ(τ)θ=θrefPrθref(τ)R(τ)]\frac{\partial}{\partial \theta} \left. \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] \right|_{\theta=\theta_{\text{ref}}} = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\frac{\partial}{\partial \theta} \text{Pr}_{\theta}(\tau)|_{\theta=\theta_{\text{ref}}}}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] The term inside the expectation on the right, θPrθ(τ)Prθref(τ)\frac{\nabla_{\theta} \text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)}, when evaluated at θ=θref\theta=\theta_{\text{ref}}, is equivalent to θlogPrθ(τ)θ=θref\nabla_{\theta} \log \text{Pr}_{\theta}(\tau)|_{\theta=\theta_{\text{ref}}}. Thus, the right-hand side is the definition of the on-policy policy gradient evaluated at the reference policy. This equivalence ensures that at the beginning of an optimization step, the update direction for the surrogate objective is the same as the true policy gradient.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After