1Cademy - Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error. **Derivation:** Let the policy gradient objective be: $$ J(\theta) \propto \sum_{\tau} \left[ \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{k=1}^{T} r_k - b \right) \right] $$ **Step 1:** The reward term, which is constant with respect to the parameters $\theta$, is moved inside the derivative: $$ \propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right] $$ **Step 2:** The reward term, which is constant for a given trajectory $\tau$, is distributed inside the summation over timesteps $t$: $$ \propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right] $$ **Step 3:** The total reward sum $\sum_{k=1}^{T} r_k$ is decomposed into rewards before and after the current timestep $t$: $$ \propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t+1}^{T} r_k - b \right) \right) \right] $$ Which step introduces an error into the derivation?

Learn Before

Derivation of Reward Decomposition in Policy Gradient with Baseline

Multiple Choice

Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.

Derivation: Let the policy gradient objective be: $J(\theta) \propto \sum_{\tau} \left[ \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{k=1}^{T} r_k - b \right) \right]$

Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative: $\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right]$

Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t): $\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{T} r_k - b \right) \right) \right]$

Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t): $\propto \sum_{\tau} \left[ \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \left( \log \pi_{\theta}(a_t|s_t) \cdot \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t+1}^{T} r_k - b \right) \right) \right]$

Which step introduces an error into the derivation?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related