1Cademy - Stabilizing Policy Gradient Learning in a High-Variance Environment

Learn Before

Policy Gradient Reformulation using Advantage Function

Case Study

Stabilizing Policy Gradient Learning in a High-Variance Environment

An agent is being trained using a policy gradient method to navigate a maze where the final reward can vary significantly due to random bonus items. For example, two identical paths taken by the agent might result in total rewards of +10 and +100, respectively. The learning process is observed to be very unstable; the agent's performance fluctuates wildly between training iterations. The current policy update rule for an action a_t taken in state s_t is proportional to log π(a_t|s_t) * G_t, where G_t is the total reward from time step t onward. Analyze the likely cause of the learning instability described in this scenario. Then, propose a specific modification to the term that multiplies log π(a_t|s_t) in the update rule to mitigate this issue, and justify why your proposed modification would lead to more stable learning.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related