1Cademy - Policy Gradient Estimate under Uniform Trajectory Probability

Learn Before

Formula

Policy Gradient Estimate under Uniform Trajectory Probability

The general form of the policy gradient is an expectation over trajectories sampled from the policy. By treating every trajectory $\tau$ in a sampled dataset $\mathcal{D}$ (with size $m$ or $|\mathcal{D}|$ ) as equally probable, we use the practical estimator: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{\tau \in \mathcal{D}} \frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau)$ By decomposing the sequence probability and dropping the dynamics gradient (which does not depend on $\theta$ ), and knowing that the cumulative reward is $R(\tau) = \sum_{t=1}^{T} r_t$ , this objective expands to: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \Big( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \sum_{t=1}^{T} r_t \Big)$ This formulation highlights that the reward function $R(\tau)$ does not need to be differentiable for optimization.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After