1Cademy - A reinforcement learning agent is being trained in a specialized environment where the total reward for any complete episode consistently falls within a narrow range of 95 to 105. The training algorithm uses a policy gradient method and incorporates a baseline by subtracting the long-term average reward (approximately 100) from each episodes total reward before performing an update. Which statement best evaluates the utility of this baseline in this specific scenario?

Learn Before

Baseline Method for Policy Gradient Variance Reduction

Multiple Choice

A reinforcement learning agent is being trained in a specialized environment where the total reward for any complete episode consistently falls within a narrow range of 95 to 105. The training algorithm uses a policy gradient method and incorporates a baseline by subtracting the long-term average reward (approximately 100) from each episode's total reward before performing an update. Which statement best evaluates the utility of this baseline in this specific scenario?

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related