Essay

Rationale for Using the Advantage Function in Policy Gradients

A standard policy gradient objective can be formulated using the total return of a trajectory. An alternative formulation, shown below, replaces the total return with an 'advantage' term, which measures how much better a specific action is compared to the average action in that state.

U(τ;θ)=t=1Tlogπθ(atst)A(st,at)U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)

Analyze why using the advantage term, A(st,at)A(s_t, a_t), in the objective function is often preferred over using the raw total return. In your analysis, discuss the impact this change has on the variance of the gradient estimates and the overall stability of the learning process.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related