Formula

Equivalence of the Surrogate Objective and the On-Policy Objective

The surrogate objective function, which evaluates a policy πθ\pi_\theta using trajectories sampled from a reference policy πθref\pi_{\theta_{\mathrm{ref}}}, is mathematically equivalent to the true on-policy objective. This equivalence is established by expanding the expectation of the importance-sampled reward into its summation form, where the probability of a trajectory under the reference policy cancels out: Eτπθref[Prθ(τ)Prθref(τ)R(τ)]=τPrθref(τ)Prθ(τ)Prθref(τ)R(τ)=τPrθ(τ)R(τ)=Eτπθ[R(τ)]\mathbb{E}_{\tau \sim \pi_{\theta_{\mathrm{ref}}}} \Big[ \frac{\mathrm{Pr}_{\theta}(\tau)}{\mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau)} R(\tau) \Big] = \sum_{\tau} \mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau) \frac{\mathrm{Pr}_{\theta}(\tau)}{\mathrm{Pr}_{\theta_{\mathrm{ref}}}(\tau)} R(\tau) = \sum_{\tau} \mathrm{Pr}_{\theta}(\tau) R(\tau) = \mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)]. However, this strict mathematical equivalence holds only when the expectation is performed over the entire sequence space. In practice, because policy learning models often sample a relatively small subset of sequences, the sampling methodology itself significantly influences the resulting estimates.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related