Formula

Derivation of the Policy Gradient Objective Function

The gradient of the policy performance objective, J(θ)J(\theta), with respect to the policy parameters θ\theta is derived using the log-derivative trick. This mathematical technique transforms the derivative into an expectation that can be estimated from sampled trajectories. The derivation is as follows: J(θ)θ=θτDPrθ(τ)R(τ)\frac{\partial J(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau)R(\tau) =τDPrθ(τ)θR(τ)= \sum_{\tau \in \mathcal{D}} \frac{\partial \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau) By multiplying and dividing by Prθ(τ)\mathrm{Pr}_{\theta}(\tau), we can introduce the gradient of the logarithm: =τDPrθ(τ)1Prθ(τ)Prθ(τ)θR(τ)= \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau) \frac{1}{\mathrm{Pr}_{\theta}(\tau)} \frac{\partial \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau) =τDPrθ(τ)logPrθ(τ)θR(τ)= \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau) \frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta} R(\tau) This final form shows that the policy gradient is the expected value of the score function, logPrθ(τ)θ\frac{\partial \log \mathrm{Pr}_{\theta}(\tau)}{\partial \theta}, weighted by the cumulative reward, R(τ)R(\tau).

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences