Formula

Policy Gradient with Advantage Function Formula

In actor-critic methods, this formula defines the gradient used to update the actor's policy parameters, θ\theta. The gradient of the policy objective function, J(θ)J(\theta), is expressed using the advantage function, A(st,at)A(s_t, a_t), which is often supplied by the critic. The gradient is estimated by averaging over a set of trajectories D\mathcal{D} sampled from the policy: J(θ)θ=1DτDθ(t=1Tlogπθ(atst)A(st,at))\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) A(s_t, a_t) \right) This update rule steers the actor's policy towards actions with a positive advantage and away from those with a negative advantage, thereby improving overall performance.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related