1Cademy - In a policy gradient algorithm, the update for the policy parameters is influenced by the term `(R - b)`, where `R` is the total reward for an episode and `b` is a baseline. Imagine you are training an agent where most episodes yield a small, positive total reward (e.g., between 1 and 5). If you set the baseline `b` to a constant, large positive value (e.g., 10), what is the most likely consequence for the learning process?

Learn Before

Baseline's Role in Centering Rewards and Reducing Gradient Variance

Multiple Choice

In a policy gradient algorithm, the update for the policy parameters is influenced by the term (R - b), where R is the total reward for an episode and b is a baseline. Imagine you are training an agent where most episodes yield a small, positive total reward (e.g., between 1 and 5). If you set the baseline b to a constant, large positive value (e.g., 10), what is the most likely consequence for the learning process?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related