Formula

Surrogate Objective at the Policy Reference Point

When the current policy parameters are identical to the reference policy parameters, a condition denoted by θ=θref\theta = \theta_{\text{ref}}, the standard importance-sampled surrogate objective simplifies. The importance sampling ratio becomes one, causing the surrogate objective's value to equal the expected reward of the reference policy: Eτπθref[Prθ(τ)Prθref(au)R(τ)]θ=θref=Eτπθref[R(τ)]\left. \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}( au)} R(\tau) \right] \right|_{\theta=\theta_{\text{ref}}} = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} [R(\tau)] In this specific context, the term 'surrogate objective' may refer to this simplified expression, which is equivalent to the true on-policy objective at this point.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After