1Cademy - In the mathematical proof demonstrating that a state-dependent baseline `b(s_t)` does not introduce bias to the policy gradient estimate, the expected value of the baseline-related term, `E[ (∇θ log πθ(a_t|s_t)) * b(s_t) ]`, evaluates to zero. Which of the following is the fundamental reason for this outcome?

Learn Before

Unbiased Nature of Policy Gradient with Baseline

Multiple Choice

In the mathematical proof demonstrating that a state-dependent baseline b(s_t) does not introduce bias to the policy gradient estimate, the expected value of the baseline-related term, E[ (∇θ log πθ(a_t|s_t)) * b(s_t) ], evaluates to zero. Which of the following is the fundamental reason for this outcome?

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related