An engineer training an agent with a policy gradient method notices that the learning process is unstable due to high variance in the gradient estimates. To address this, they introduce a baseline which is subtracted from the rewards. What is the expected statistical consequence of this modification?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer training an agent with a policy gradient method notices that the learning process is unstable due to high variance in the gradient estimates. To address this, they introduce a baseline which is subtracted from the rewards. What is the expected statistical consequence of this modification?
Differential Impact of Baselines on Variance
Introducing a baseline into a policy gradient algorithm is an effective technique for reducing the variance of the total rewards collected by the agent, which in turn stabilizes the learning process.