Relation

Advantage Function as a Form of Shaped Reward

The value-based shaped reward, defined as r=r+γV(st+1)V(st)r' = r + \gamma V(s_{t+1}) - V(s_t), is mathematically equivalent to the Temporal Difference (TD) error, which is a common estimator for the advantage function. This equivalence establishes a direct relationship between advantage-based methods, such as PPO, and reward shaping, demonstrating that the advantage function can be interpreted as a specific instance of a shaped reward.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences