Formula

Policy Gradient with Baseline

To reduce the variance of the policy gradient estimator, a baseline term, bb, can be subtracted from the total trajectory reward, R(τ)=t=1TrtR(\tau) = \sum_{t=1}^{T} r_t. This modification does not introduce bias into the gradient estimate as long as the baseline does not depend on the action ata_t. The resulting formula for the policy gradient is: J(θ)θ=1DτD(θt=1Tlogπθ(atst))(t=1Trtb)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|D|} \sum_{\tau \in D} \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right) A common choice for the baseline is an estimate of the state-value function, V(st)V(s_t).

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences