1Cademy - Rationale for Using a Baseline in Policy Gradients

Learn Before

Policy Gradient with Baseline

Short Answer

Rationale for Using a Baseline in Policy Gradients

In the context of training a policy, consider a scenario where all rewards for a set of actions are positive (e.g., ranging from +10 to +20). Without a baseline, the term weighting the gradient update is always positive. Explain how subtracting a baseline (e.g., the average reward of +15) from the total reward helps to create a more stable and effective learning signal, specifically addressing how it reduces the variance of the gradient estimate.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related