1Cademy - Consider the calculation of the policy gradient with respect to the parameters θ for an action `a_t` taken at time `t`. A proposed update rule multiplies the term `∇_θ log π_θ(a_t|s_t)` by the sum of all rewards in the trajectory (`∑_{k=1}^{T} r_k`). Including the rewards received before time `t` (`∑_{k=1}^{t-1} r_k`) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.

Learn Before

Irrelevance of Past Rewards for Policy Gradient Calculation

True/False

Consider the calculation of the policy gradient with respect to the parameters θ for an action a_t taken at time t. A proposed update rule multiplies the term ∇_θ log π_θ(a_t|s_t) by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before time t (∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related