1Cademy - Reward-to-Go

Method 1: The score is the sum of all rewards in the sequence: r_1 + r_2 + ... + r_T .
Method 2: The score is the sum of rewards from time step t onward: r_t + r_{t+1} + ... + r_T .

Learn Before

Policy Gradient Estimate under Uniform Trajectory Probability

Definition

Reward-to-Go

The reward-to-go, often denoted G_t, represents the cumulative reward from a specific time step t until the end of an episode. It is calculated as: $G_t = \sum_{k=t}^{T} r_k$ In policy gradient methods, using the reward-to-go to weight an action's log-probability is a key variance reduction technique. It improves upon using the total trajectory reward by ensuring that an action's update is only influenced by subsequent rewards, which respects causality and provides a more accurate credit assignment.