1Cademy - In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep `t` is an estimate of the value of the specific action `a_t` taken in state `s_t`. What is the primary theoretical problem with this choice of baseline?

Learn Before

Policy Gradient with Baseline

Multiple Choice

In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep t is an estimate of the value of the specific action a_t taken in state s_t. What is the primary theoretical problem with this choice of baseline?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related