1Cademy - In a policy gradient algorithm, a common technique to stabilize learning is to subtract a calculated value from the total reward of each trajectory before computing the update. This is done to reduce the variability of the updates without altering their expected direction. Which of the following calculated values, if subtracted from the total reward, would introduce an incorrect bias and potentially lead the policy updates in the wrong direction on average?

Learn Before

Unbiased Nature of Policy Gradient with Baseline

Multiple Choice

In a policy gradient algorithm, a common technique to stabilize learning is to subtract a calculated value from the total reward of each trajectory before computing the update. This is done to reduce the variability of the updates without altering their expected direction. Which of the following calculated values, if subtracted from the total reward, would introduce an incorrect bias and potentially lead the policy updates in the wrong direction on average?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related