Formula

Policy Gradient Estimate under Uniform Trajectory Probability

The general form of the policy gradient is an expectation over trajectories sampled from the policy. By treating every trajectory τ\tau in a sampled dataset D\mathcal{D} (with size mm or D|\mathcal{D}|) as equally probable, we use the practical estimator: J(θ)θ=1mτDlogPrθ(τ)θR(τ)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} \sum_{\tau \in \mathcal{D}} \frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau) By decomposing the sequence probability and dropping the dynamics gradient (which does not depend on θ\theta), and knowing that the cumulative reward is R(τ)=t=1TrtR(\tau) = \sum_{t=1}^{T} r_t, this objective expands to: J(θ)θ=1DτDθ(t=1Tlogπθ(atst)t=1Trt)\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \Big( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) \sum_{t=1}^{T} r_t \Big) This formulation highlights that the reward function R(τ)R(\tau) does not need to be differentiable for optimization.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Related