1Cademy - Irrelevance of Past Rewards for Policy Gradient Calculation

Learn Before

Derivation of Reward Decomposition in Policy Gradient with Baseline
Causality Constraint in Reinforcement Learning

Concept

Irrelevance of Past Rewards for Policy Gradient Calculation

In Markov decision processes, the future is independent of the past given the present. Consequently, an action taken at time step $t$ cannot influence the rewards received before $t$ . Since the rewards prior to $t$ are already "fixed" by the time the action is chosen, the term representing the sum of these past rewards, $\sum_{k=1}^{t-1} r_k$ , does not contribute to the gradient and can be omitted. Eliminating this term helps to further reduce the variance of the policy gradient.