In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?
In a policy optimization algorithm that uses an importance-sampled surrogate objective, a developer observes that the gradient of the surrogate objective is identical to the on-policy policy gradient at the start of an update step. However, after applying a single gradient update to the policy parameters, the two gradients are no longer identical. This divergence indicates a flaw in the algorithm's implementation.
In policy optimization, an objective function is often constructed using data from a fixed, older policy (the 'reference policy') to estimate the performance of a new policy being optimized. This objective uses an importance sampling ratio:
Expectation_over_trajectories_from_reference_policy [ (Probability_of_trajectory_under_new_policy / Probability_of_trajectory_under_reference_policy) * Reward_of_trajectory ]. A critical property of this objective is that its gradient, when evaluated at the point where the new policy is identical to the reference policy, is exactly equal to the standard on-policy policy gradient. Which of the following statements provides the core mathematical justification for why this equivalence holds?