1Cademy - Baselines Role in Centering Rewards and Reducing Gradient Variance

Learn Before

Baseline Method for Policy Gradient Variance Reduction

Concept

Baseline's Role in Centering Rewards and Reducing Gradient Variance

The subtraction of a baseline, $b$ , from the total reward, $\sum_{t=1}^{T} r_t$ , serves to center the reward values. For instance, if the baseline is defined as the expected total reward, this operation centers the rewards around zero. This centering is the direct mechanism for variance reduction, as it stabilizes the value of the product term, $(\sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t))(\sum_{t=1}^{T} r_t - b)$ , which is used to estimate the policy gradient.

Updated 2025-10-06

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

In a policy gradient algorithm, the update for the policy parameters is influenced by the term (R - b), where R is the total reward for an episode and b is a baseline. Imagine you are training an agent where most episodes yield a small, positive total reward (e.g., between 1 and 5). If you set the baseline b to a constant, large positive value (e.g., 10), what is the most likely consequence for the learning process?
Diagnosing Training Instability
In policy gradient methods, subtracting a baseline from the total reward is a technique used to reduce gradient variance. A key property of a properly chosen baseline is that it alters the expected value of the policy gradient, making the updates more conservative.

Learn Before

Related

Learn After