Concept

Irrelevance of Past Rewards for Policy Gradient Calculation

In Markov decision processes, the future is independent of the past given the present. Consequently, an action taken at time step tt cannot influence the rewards received before tt. Since the rewards prior to tt are already "fixed" by the time the action is chosen, the term representing the sum of these past rewards, k=1t1rk\sum_{k=1}^{t-1} r_k, does not contribute to the gradient and can be omitted. Eliminating this term helps to further reduce the variance of the policy gradient.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related