Short Answer

Benefit of a Baseline in a Positive-Reward Environment

A reinforcement learning agent is being trained in an environment where the total reward for any complete episode is always a large positive number, ranging from +500 to +1000. An engineer decides to modify the learning algorithm by subtracting a baseline value of 750 (the average reward) from the total reward before updating the policy. Explain why this modification is likely to improve the stability of the training process, even though all rewards are already positive.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science