1Cademy - Policy Gradient Objective with Advantage Function

Learn Before

Derivation of the Policy Gradient Objective Function

Formula

Policy Gradient Objective with Advantage Function

In policy gradient methods, a common objective function to maximize is formulated using the advantage function, $A(s_t, a_t)$ , to improve training stability. This objective, denoted as $U(\tau; \theta)$ , is expressed as the sum over a trajectory of the log-probabilities of actions multiplied by their corresponding advantage values: $U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)$ Here: - $\pi_{\theta}(a_t|s_t)$ is the policy, which gives the probability of taking action $a_t$ in state $s_t$ . - $A(s_t, a_t)$ is the advantage function, which measures how much better action $a_t$ is compared to the expected value in state $s_t$ . Maximizing this objective via gradient ascent encourages the policy to take actions that have a higher-than-average expected return.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After