1Cademy - In policy gradient methods, the gradient of the log-probability of a trajectory is initially expressed as the sum of two components: one related to the agents actions and another related to the environments transitions. The expression is then simplified by removing the environments component before optimization. Given the initial expression: $$ \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) + \sum_{t=1}^{T} \log \text{Pr}(s_{t+1}|s_t, a_t) \right] $$ What is the fundamental assumption that justifies simplifying this to just the policy component, $$ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s

Learn Before

Simplification of the Trajectory Log-Probability Gradient

Multiple Choice

In policy gradient methods, the gradient of the log-probability of a trajectory is initially expressed as the sum of two components: one related to the agent's actions and another related to the environment's transitions. The expression is then simplified by removing the environment's component before optimization. Given the initial expression: $\frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) + \sum_{t=1}^{T} \log \text{Pr}(s_{t+1}|s_t, a_t) \right]$ What is the fundamental assumption that justifies simplifying this to just the policy component, $\frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)$ ?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related