1Cademy - Policy Gradient with Advantage Function Formula

Learn Before

Formula

Policy Gradient with Advantage Function Formula

In actor-critic methods, this formula defines the gradient used to update the actor's policy parameters, $\theta$ . The gradient of the policy objective function, $J(\theta)$ , is expressed using the advantage function, $A(s_t, a_t)$ , which is often supplied by the critic. The gradient is estimated by averaging over a set of trajectories $\mathcal{D}$ sampled from the policy: $\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) A(s_t, a_t) \right)$ This update rule steers the actor's policy towards actions with a positive advantage and away from those with a negative advantage, thereby improving overall performance.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After