Baseline's Role in Centering Rewards and Reducing Gradient Variance
The subtraction of a baseline, , from the total reward, , serves to center the reward values. For instance, if the baseline is defined as the expected total reward, this operation centers the rewards around zero. This centering is the direct mechanism for variance reduction, as it stabilizes the value of the product term, , which is used to estimate the policy gradient.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Gradient Estimate with Baseline
Baseline's Role in Centering Rewards and Reducing Gradient Variance
State-Value Function as a Baseline
Baseline's Impact on Reward Variance vs. Gradient Estimate Variance
An engineer is training two reinforcement learning agents (Agent A and Agent B) on the same task using a policy gradient method. The environment has a wide range of possible total rewards, from highly negative to highly positive. Agent A's learning algorithm directly uses the total reward received after each episode to update its policy. Agent B's algorithm first subtracts a constant value (equal to the average total reward observed so far) from the total reward before using it for the update. What is the most likely difference in the training process between Agent A and Agent B?
Benefit of a Baseline in a Positive-Reward Environment
A reinforcement learning agent is being trained in a specialized environment where the total reward for any complete episode consistently falls within a narrow range of 95 to 105. The training algorithm uses a policy gradient method and incorporates a baseline by subtracting the long-term average reward (approximately 100) from each episode's total reward before performing an update. Which statement best evaluates the utility of this baseline in this specific scenario?
Learn After
In a policy gradient algorithm, the update for the policy parameters is influenced by the term
(R - b), whereRis the total reward for an episode andbis a baseline. Imagine you are training an agent where most episodes yield a small, positive total reward (e.g., between 1 and 5). If you set the baselinebto a constant, large positive value (e.g., 10), what is the most likely consequence for the learning process?Diagnosing Training Instability
In policy gradient methods, subtracting a baseline from the total reward is a technique used to reduce gradient variance. A key property of a properly chosen baseline is that it alters the expected value of the policy gradient, making the updates more conservative.