Short Answer

Analysis of the Baseline's Effect on Policy Gradient Expectation

In policy gradient methods, a baseline term, which often depends only on the state, is subtracted from the reward to reduce the variance of the gradient estimate. Explain mathematically why the expectation of the gradient contribution from such a baseline is zero, and therefore does not introduce bias to the overall policy gradient estimate. Your explanation should focus on the properties of the score function (the gradient of the log-policy).

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science