Multiple Choice

In the mathematical proof demonstrating that a state-dependent baseline b(s_t) does not introduce bias to the policy gradient estimate, the expected value of the baseline-related term, E[ (∇θ log πθ(a_t|s_t)) * b(s_t) ], evaluates to zero. Which of the following is the fundamental reason for this outcome?

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science