1Cademy - In policy gradient methods, subtracting a baseline from the total reward is a technique used to reduce gradient variance. A key property of a properly chosen baseline is that it alters the expected value of the policy gradient, making the updates more conservative.

Learn Before

Baseline's Role in Centering Rewards and Reducing Gradient Variance

True/False

In policy gradient methods, subtracting a baseline from the total reward is a technique used to reduce gradient variance. A key property of a properly chosen baseline is that it alters the expected value of the policy gradient, making the updates more conservative.

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences