Equivalence of the Surrogate Objective and the On-Policy Objective
The surrogate objective function, which evaluates a policy using trajectories sampled from a reference policy , is mathematically equivalent to the true on-policy objective. This equivalence is established by expanding the expectation of the importance-sampled reward into its summation form, where the probability of a trajectory under the reference policy cancels out: . However, this strict mathematical equivalence holds only when the expectation is performed over the entire sequence space. In practice, because policy learning models often sample a relatively small subset of sequences, the sampling methodology itself significantly influences the resulting estimates.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Surrogate Objective in Reinforcement Learning
Equivalence of the Surrogate Objective and the On-Policy Objective
An agent's performance is being evaluated using a set of recorded experiences (trajectories) that were generated by an older, reference policy. The new, target policy being evaluated makes a specific high-reward trajectory significantly less probable than the reference policy did. How will the contribution of this specific high-reward trajectory be adjusted when estimating the performance of the new target policy?
Off-Policy Performance Estimation
Consider an off-policy evaluation scenario where the performance of a 'target' policy is estimated using data collected from a 'reference' policy. If the target policy is identical to the reference policy, the importance sampling weight used to adjust the reward of every possible trajectory will be exactly 1.
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function
Equivalence of the Surrogate Objective and the On-Policy Objective
A reinforcement learning agent has developed a new policy, denoted as π_new, for navigating a maze. The goal is to accurately estimate the performance of this specific policy using its on-policy objective function, which is defined as the expected cumulative reward over trajectories generated by the policy itself. Which of the following procedures correctly describes how to gather data and compute this estimate?
Evaluating a New Robotic Arm Policy
A research team is training an agent and has a policy represented by parameters θ_current. To evaluate the performance of this policy using its on-policy objective function, J(θ_current), the team can use a large, pre-existing dataset of trajectories that were collected while the agent was operating under a slightly older set of parameters, θ_previous.
Learn After
In reinforcement learning, the standard on-policy objective is defined as the expected reward under the current policy π_θ: J_on-policy(θ) = E_{τ ~ π_θ} [R(τ)]. An alternative 'surrogate' objective uses importance sampling to evaluate the policy using data from a reference policy π_θ_ref: J_surrogate(θ) = E_{τ ~ π_θ_ref} [ (Pr_θ(τ) / Pr_θ_ref(τ)) * R(τ) ]. What is the key mathematical step that demonstrates that J_surrogate(θ) is exactly equivalent to J_on-policy(θ)?
The following steps demonstrate that the surrogate objective, which uses importance sampling, is equivalent to the standard on-policy objective in reinforcement learning. Arrange these mathematical steps in the correct logical order to form the complete derivation.
Critique of Surrogate Objective Approximation