Formula

Policy Gradient Objective with Advantage Function

In policy gradient methods, a common objective function to maximize is formulated using the advantage function, A(st,at)A(s_t, a_t), to improve training stability. This objective, denoted as U(τ;θ)U(\tau; \theta), is expressed as the sum over a trajectory of the log-probabilities of actions multiplied by their corresponding advantage values: U(τ;θ)=t=1Tlogπθ(atst)A(st,at)U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t) Here: - πθ(atst)\pi_{\theta}(a_t|s_t) is the policy, which gives the probability of taking action ata_t in state sts_t. - A(st,at)A(s_t, a_t) is the advantage function, which measures how much better action ata_t is compared to the expected value in state sts_t. Maximizing this objective via gradient ascent encourages the policy to take actions that have a higher-than-average expected return.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After