Multiple Choice

In a policy gradient algorithm, the update for the policy parameters is influenced by the term (R - b), where R is the total reward for an episode and b is a baseline. Imagine you are training an agent where most episodes yield a small, positive total reward (e.g., between 1 and 5). If you set the baseline b to a constant, large positive value (e.g., 10), what is the most likely consequence for the learning process?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science