1Cademy - Policy Gradient with Baseline

Learn Before

Policy Gradient Estimate under Uniform Trajectory Probability

Formula

Policy Gradient with Baseline

To reduce the variance of the policy gradient estimator, a baseline term, $b$ , can be subtracted from the total trajectory reward, $R(\tau) = \sum_{t=1}^{T} r_t$ . This modification does not introduce bias into the gradient estimate as long as the baseline does not depend on the action $a_t$ . The resulting formula for the policy gradient is: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|D|} \sum_{\tau \in D} \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right)$ A common choice for the baseline is an estimate of the state-value function, $V(s_t)$ .

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After