Learn Before
Analyzing the Impact of a State-Value Baseline
In a policy gradient algorithm, the state-value function is often used as a baseline to reduce the variance of gradient estimates. Explain how subtracting the expected future reward from the current state (the state-value) helps to differentiate between a 'good' action taken in a 'bad' state versus a 'mediocre' action taken in a 'good' state. How does this differentiation lead to a more stable learning signal?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Advantage Function Definition
In a reinforcement learning algorithm, a baseline is subtracted from the total reward to stabilize the learning process. Consider two different baseline strategies:
Strategy 1: Use a single, fixed value for the baseline, such as the average total reward calculated over many past episodes. Strategy 2: Use a dynamic value for the baseline that is equal to the expected future reward from the agent's current state.
Why is Strategy 2 generally more effective at reducing the variance of the policy updates compared to Strategy 1?
Evaluating Actions with a State-Value Baseline
Analyzing the Impact of a State-Value Baseline