1Cademy - Impact of Reward Scale Variation on Policy Gradient Variance

Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.

Learn Before

High Variance in Policy Gradient Estimates

Concept

Impact of Reward Scale Variation on Policy Gradient Variance

A significant reason for the high variance in policy gradient methods is that rewards can fluctuate drastically across different steps. For example, if a reward model provides small positive rewards for good actions (such as $R_t = 2$ ) but imposes massive penalties for poor actions (such as $R_t = -50$ ), the overall sequence might yield a very low total reward, even if it contains many good actions. This disparity obscures the value of individual good actions.