1Cademy - In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?

Learn Before

Equivalence of Surrogate and On-Policy Gradients at the Reference Point

Multiple Choice

In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related