1Cademy - Derivation of the Policy Gradient Objective Function

Learn Before

Objective Function as Expected Cumulative Reward (Performance Function)

Formula

Derivation of the Policy Gradient Objective Function

The gradient of the policy performance objective, $J(\theta)$ , with respect to the policy parameters $\theta$ is derived using the log-derivative trick. This mathematical technique transforms the derivative into an expectation that can be estimated from sampled trajectories. The derivation is as follows: $\frac{\partial J(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau)R(\tau)$ $= \sum_{\tau \in \mathcal{D}} \frac{\partial \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau)$ By multiplying and dividing by $\mathrm{Pr}_{\theta}(\tau)$ , we can introduce the gradient of the logarithm: $= \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau) \frac{1}{\mathrm{Pr}_{\theta}(\tau)} \frac{\partial \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau)$ $= \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau) \frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau)$ This final form shows that the policy gradient is the expected value of the score function, $\frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta}$ , weighted by the cumulative reward, $R(\tau)$ .

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After