1Cademy - In policy optimization, an objective function is often constructed using data from a fixed, older policy (the reference policy) to estimate the performance of a new policy being optimized. This objective uses an importance sampling ratio: `Expectation_over_trajectories_from_reference_policy [ (Probability_of_trajectory_under_new_policy / Probability_of_trajectory_under_reference_policy) * Reward_of_trajectory ]`. A critical property of this objective is that its gradient, when evaluated at the point where the new policy is identical to the reference policy, is exactly equal to the standard on-policy policy gradient. Which of the following statements provides the core mathematical justification for why this equivalence holds?

Learn Before

Equivalence of Surrogate and On-Policy Gradients at the Reference Point

Multiple Choice

In policy optimization, an objective function is often constructed using data from a fixed, older policy (the 'reference policy') to estimate the performance of a new policy being optimized. This objective uses an importance sampling ratio: Expectation_over_trajectories_from_reference_policy [ (Probability_of_trajectory_under_new_policy / Probability_of_trajectory_under_reference_policy) * Reward_of_trajectory ]. A critical property of this objective is that its gradient, when evaluated at the point where the new policy is identical to the reference policy, is exactly equal to the standard on-policy policy gradient. Which of the following statements provides the core mathematical justification for why this equivalence holds?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related