1Cademy - Diagnosing Unstable Reinforcement Learning Training

Learn Before

High Variance in Policy Gradient Estimates

Case Study

Diagnosing Unstable Reinforcement Learning Training

Given the following training scenario, identify the most probable underlying issue with the gradient estimation process and explain why it leads to the observed unstable performance.

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Baseline Method for Policy Gradient Variance Reduction
Total Reward (Return)
An agent is trained using a policy gradient method where the policy is updated based on the total reward of an entire trajectory. Consider two different trajectories that result in the same total reward:
- Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
- Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.
Which of the following statements best analyzes the impact of these reward distributions on the policy update?
Diagnosing Unstable Reinforcement Learning Training
True or False: In a basic policy gradient method, if an agent completes a trajectory with a high positive total reward, the learning algorithm will reinforce every action taken during that trajectory, even those that were suboptimal or did not directly contribute to the final outcome.
Impact of Reward Scale Variation on Policy Gradient Variance

Learn Before

Related