Formula

Decomposition of Reward Sum for Causality in Policy Gradients

To distinguish between rewards accrued before and after an action at time step tt, the total reward inside the gradient calculation can be decomposed. The sum is brought inside the gradient operator and split into past and future components: =1DτDθ[t=1Tlogπθ(atst)(k=1Trkb)]\dots = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{T} r_k - b \right) \right] =1DτDθ[t=1Tlogπθ(atst)(k=1t1rk+k=tTrkb)]= \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \left( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k - b \right) \right] This decomposition explicitly separates the past rewards term (k=1t1rk\sum_{k=1}^{t-1} r_k), preparing the equation so this term can be safely omitted to further reduce gradient variance.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences