Concept

Policy Gradient Reformulation using Advantage Function

The policy gradient, which represents the gradient of the objective function J(θ)J(\theta), can be reformulated to use the advantage function A(st,at)A(s_t, a_t). This substitution is a common technique in policy gradient algorithms because it helps to reduce the high variance often associated with gradient estimates, leading to more stable and efficient learning.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences