Learn Before
Multiple Choice

In a reinforcement learning algorithm, a baseline is subtracted from the total reward to stabilize the learning process. Consider two different baseline strategies:

Strategy 1: Use a single, fixed value for the baseline, such as the average total reward calculated over many past episodes. Strategy 2: Use a dynamic value for the baseline that is equal to the expected future reward from the agent's current state.

Why is Strategy 2 generally more effective at reducing the variance of the policy updates compared to Strategy 1?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science