Formula

Policy Gradient with Reward-to-Go and Baseline

By omitting the term for past rewards that do not contribute to the gradient, we arrive at a simplified version of the policy gradient with a baseline. This formulation isolates the future rewards (the reward-to-go) from time step tt onwards: J(θ)θ=1DτDθ[t=1Tlogπθ(atst)(k=tTrkb)]\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=t}^{T} r_k - b \right) \right] Removing the past rewards not only simplifies the calculation but can further reduce the variance of the gradient estimate.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related