Short Answer

Rationale for Using a Surrogate Objective

In the context of policy optimization, an agent's performance is ultimately measured by the on-policy objective, Eτπθ[R(τ)]\mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)]. However, many algorithms instead optimize a surrogate objective, such as Eτπθref[Prθ(τ)Prθref(τ)R(τ)]\mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right], where data is sampled from a reference policy πθref\pi_{\theta_{\text{ref}}}. Analyze the primary practical advantage of optimizing this surrogate objective compared to directly optimizing the on-policy objective.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science