Stabilizing Policy Gradient Learning in a High-Variance Environment
An agent is being trained using a policy gradient method to navigate a maze where the final reward can vary significantly due to random bonus items. For example, two identical paths taken by the agent might result in total rewards of +10 and +100, respectively. The learning process is observed to be very unstable; the agent's performance fluctuates wildly between training iterations. The current policy update rule for an action a_t taken in state s_t is proportional to log π(a_t|s_t) * G_t, where G_t is the total reward from time step t onward. Analyze the likely cause of the learning instability described in this scenario. Then, propose a specific modification to the term that multiplies log π(a_t|s_t) in the update rule to mitigate this issue, and justify why your proposed modification would lead to more stable learning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Stabilizing Policy Gradient Learning in a High-Variance Environment
A reinforcement learning agent is being trained using a policy gradient method. During training, the agent's performance is highly erratic, and the estimated gradients for policy updates have very high variance. Which of the following changes to the gradient estimation process is most directly aimed at stabilizing learning by reducing this variance?
Rationale for Using the Advantage Function