1Cademy - In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:<br><br>$$ \nabla_\theta J(\theta) \propto \sum_{t=1}^{T} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \left( \left( \sum_{k=1}^{t-1} r_k \right) + \left( \sum_{k=t}^{T} r_k \right) - b \right) \right] $$ <br><br>Here, `t` is a specific timestep within the trajectory of length `T`, `\pi_\theta(a_t|s_t)` is the probability of taking action `a_t` in state `s_t`, `r_k` is the reward at timestep `k`, and `b` is a constant value. Which statement best analyzes the relationship between the policy term for timestep `t` ( `\nabla_\theta \log \pi_\theta(a_t|s_t)` ) and the two components of the reward sum?

Learn Before

Decomposition of Reward Sum for Causality in Policy Gradients

Multiple Choice

In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:

$\nabla_\theta J(\theta) \propto \sum_{t=1}^{T} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \left( \left( \sum_{k=1}^{t-1} r_k \right) + \left( \sum_{k=t}^{T} r_k \right) - b \right) \right]$

Here, t is a specific timestep within the trajectory of length T, \pi_\theta(a_t|s_t) is the probability of taking action a_t in state s_t, r_k is the reward at timestep k, and b is a constant value. Which statement best analyzes the relationship between the policy term for timestep t ( \nabla_\theta \log \pi_\theta(a_t|s_t) ) and the two components of the reward sum?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related