1Cademy - Policy Gradient Estimate with Baseline

Learn Before

Baseline Method for Policy Gradient Variance Reduction

Formula

Policy Gradient Estimate with Baseline

The policy gradient model can be refined by incorporating a baseline, $b$ , to reduce variance. The baseline is subtracted from the total reward for each trajectory. Estimated from a dataset $\mathcal{D}$ of trajectories, the formula for the policy gradient with a baseline is given by: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right)$ This adjustment helps stabilize the learning process by reducing the variance of the gradient estimate, making updates less sensitive to extreme fluctuations in individual rewards.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After