1Cademy - Policy Gradient with Reward-to-Go and Baseline

Learn Before

Irrelevance of Past Rewards for Policy Gradient Calculation
Advantage Function Estimation using Reward-to-Go
Decomposition of Reward Sum for Causality in Policy Gradients

Formula

Policy Gradient with Reward-to-Go and Baseline

By omitting the term for past rewards that do not contribute to the gradient, we arrive at a simplified version of the policy gradient with a baseline. This formulation isolates the future rewards (the reward-to-go) from time step $t$ onwards: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=t}^{T} r_k - b \right) \right]$ Removing the past rewards not only simplifies the calculation but can further reduce the variance of the gradient estimate.