1Cademy - Surrogate Objective at the Policy Reference Point

Learn Before

Surrogate Objective in Reinforcement Learning

Formula

Surrogate Objective at the Policy Reference Point

When the current policy parameters are identical to the reference policy parameters, a condition denoted by $\theta = \theta_{\text{ref}}$ , the standard importance-sampled surrogate objective simplifies. The importance sampling ratio becomes one, causing the surrogate objective's value to equal the expected reward of the reference policy: $\left. \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}( au)} R(\tau) \right] \right|_{\theta=\theta_{\text{ref}}} = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} [R(\tau)]$ In this specific context, the term 'surrogate objective' may refer to this simplified expression, which is equivalent to the true on-policy objective at this point.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After