Case Study

Stabilizing Policy Gradient Learning in a High-Variance Environment

An agent is being trained using a policy gradient method to navigate a maze where the final reward can vary significantly due to random bonus items. For example, two identical paths taken by the agent might result in total rewards of +10 and +100, respectively. The learning process is observed to be very unstable; the agent's performance fluctuates wildly between training iterations. The current policy update rule for an action a_t taken in state s_t is proportional to log π(a_t|s_t) * G_t, where G_t is the total reward from time step t onward. Analyze the likely cause of the learning instability described in this scenario. Then, propose a specific modification to the term that multiplies log π(a_t|s_t) in the update rule to mitigate this issue, and justify why your proposed modification would lead to more stable learning.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science