Concept

Policy Gradient Reformulation using Advantage Function

The policy gradient, which represents the gradient of the objective function J(heta)J( heta), can be reformulated to use the advantage function A(st,at)A(s_t, a_t). This substitution is a common technique in policy gradient algorithms because it helps to reduce the high variance often associated with gradient estimates, leading to more stable and efficient learning.

0

1

Updated 2026-05-17

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences