1Cademy - Baseline Method for Policy Gradient Variance Reduction

Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.

Learn Before

High Variance in Policy Gradient Estimates
Total Reward (Return)

Concept

Baseline Method for Policy Gradient Variance Reduction

A straightforward technique to lower the variance of the policy gradient is to introduce a baseline, denoted as $b$ . This baseline acts as a reference point and is subtracted from the total reward, modifying the term to $\sum_{t=1}^{T} r_t - b$ . By centering the rewards around this baseline (e.g., if $b$ is defined as the expected value of the total reward, this operation centers the rewards around zero), we remove systematic biases in the reward signal. This makes the learning updates more stable and less sensitive to extreme fluctuations in individual rewards without introducing bias.