1Cademy - Derivation of Reward Decomposition in Policy Gradient with Baseline

Learn Before

Policy Gradient Estimate with Baseline

Formula

Derivation of Reward Decomposition in Policy Gradient with Baseline

The policy gradient with a baseline can be mathematically manipulated to separate past and future rewards, which is a key step toward applying the causality principle for variance reduction. The derivation begins with the standard policy gradient formula with a baseline. The total reward term is then distributed into the sum over timesteps, and subsequently, this total reward sum is decomposed into rewards accumulated before the current timestep $t$ and rewards from timestep $t$ onward. The derivation proceeds as follows: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right)$ $= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{T} r_k - b \right) \right]$ $= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k - b \right) \right]$ This final expression makes the distinction between past and future rewards explicit, setting the stage for eliminating the irrelevant past rewards from the gradient calculation.

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After