1Cademy - In a reinforcement learning setting, the update for a policy `π_θ(a_t|s_t)` depends on a gradient calculation. For a specific timestep `t` within a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before time `t` from those that occurred at or after time `t`:<br><br>`log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )`<br><br>When computing the gradient of this expression with respect to the policy parameters `θ`, how does the `∑_{k=1}^{t-1} r_k` term (the sum of past rewards) influence the gradient associated with the action at time `t`?

Learn Before

Irrelevance of Past Rewards for Policy Gradient Calculation

Multiple Choice

In a reinforcement learning setting, the update for a policy π_θ(a_t|s_t) depends on a gradient calculation. For a specific timestep t within a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before time t from those that occurred at or after time t:

log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )

When computing the gradient of this expression with respect to the policy parameters θ, how does the ∑_{k=1}^{t-1} r_k term (the sum of past rewards) influence the gradient associated with the action at time t?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related