Concept

Impact of Reward Scale Variation on Policy Gradient Variance

A significant reason for the high variance in policy gradient methods is that rewards can fluctuate drastically across different steps. For example, if a reward model provides small positive rewards for good actions (such as Rt=2R_t = 2) but imposes massive penalties for poor actions (such as Rt=50R_t = -50), the overall sequence might yield a very low total reward, even if it contains many good actions. This disparity obscures the value of individual good actions.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences