1Cademy - Decomposition of Reward Sum for Causality in Policy Gradients

Learn Before

Policy Gradient with Baseline

Formula

Decomposition of Reward Sum for Causality in Policy Gradients

To distinguish between rewards accrued before and after an action at time step $t$ , the total reward inside the gradient calculation can be decomposed. The sum is brought inside the gradient operator and split into past and future components: $\dots = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{T} r_k - b \right) \right]$ $= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k - b \right) \right]$ This decomposition explicitly separates the past rewards term ( $\sum_{k=1}^{t-1} r_k$ ), preparing the equation so this term can be safely omitted to further reduce gradient variance.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After