Formula

Derivation of Reward Decomposition in Policy Gradient with Baseline

The policy gradient with a baseline can be mathematically manipulated to separate past and future rewards, which is a key step toward applying the causality principle for variance reduction. The derivation begins with the standard policy gradient formula with a baseline. The total reward term is then distributed into the sum over timesteps, and subsequently, this total reward sum is decomposed into rewards accumulated before the current timestep tt and rewards from timestep tt onward. The derivation proceeds as follows: J(θ)θ=1DτD(θt=1Tlogπθ(atst))(t=1Trtb)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \left( \frac{\partial}{\partial \theta} \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \right) \left( \sum_{t=1}^{T} r_t - b \right) =1DτDθ[t=1Tlogπθ(atst)(k=1Trkb)]= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{T} r_k - b \right) \right] =1DτDθ[t=1Tlogπθ(atst)(k=1t1rk+k=tTrkb)]= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k - b \right) \right] This final expression makes the distinction between past and future rewards explicit, setting the stage for eliminating the irrelevant past rewards from the gradient calculation.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After