1Cademy - In a policy optimization algorithm that uses an importance-sampled surrogate objective, a developer observes that the gradient of the surrogate objective is identical to the on-policy policy gradient at the start of an update step. However, after applying a single gradient update to the policy parameters, the two gradients are no longer identical. This divergence indicates a flaw in the algorithms implementation.

Learn Before

Equivalence of Surrogate and On-Policy Gradients at the Reference Point

True/False

In a policy optimization algorithm that uses an importance-sampled surrogate objective, a developer observes that the gradient of the surrogate objective is identical to the on-policy policy gradient at the start of an update step. However, after applying a single gradient update to the policy parameters, the two gradients are no longer identical. This divergence indicates a flaw in the algorithm's implementation.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related