Formula

Policy Gradient Estimate with Baseline

The policy gradient model can be refined by incorporating a baseline, bb, to reduce variance. The baseline is subtracted from the total reward for each trajectory. Estimated from a dataset D\mathcal{D} of trajectories, the formula for the policy gradient with a baseline is given by: J(θ)θ=1DτDθ(t=1Tlogπθ(atst))(t=1Trtb)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right) This adjustment helps stabilize the learning process by reducing the variance of the gradient estimate, making updates less sensitive to extreme fluctuations in individual rewards.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences