1Cademy - In a reinforcement learning algorithm, a baseline is subtracted from the total reward to stabilize the learning process. Consider two different baseline strategies: Strategy 1: Use a single, fixed value for the baseline, such as the average total reward calculated over many past episodes. Strategy 2: Use a dynamic value for the baseline that is equal to the expected future reward from the agents *current state*. Why is Strategy 2 generally more effective at reducing the variance of the policy updates compared to Strategy 1?

Learn Before

State-Value Function as a Baseline

Multiple Choice

In a reinforcement learning algorithm, a baseline is subtracted from the total reward to stabilize the learning process. Consider two different baseline strategies:

Strategy 1: Use a single, fixed value for the baseline, such as the average total reward calculated over many past episodes. Strategy 2: Use a dynamic value for the baseline that is equal to the expected future reward from the agent's current state.

Why is Strategy 2 generally more effective at reducing the variance of the policy updates compared to Strategy 1?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related