Learn Before
Impact of Reward Scale Variation on Policy Gradient Variance
A significant reason for the high variance in policy gradient methods is that rewards can fluctuate drastically across different steps. For example, if a reward model provides small positive rewards for good actions (such as ) but imposes massive penalties for poor actions (such as ), the overall sequence might yield a very low total reward, even if it contains many good actions. This disparity obscures the value of individual good actions.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Baseline Method for Policy Gradient Variance Reduction
Total Reward (Return)
An agent is trained using a policy gradient method where the policy is updated based on the total reward of an entire trajectory. Consider two different trajectories that result in the same total reward:
- Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
- Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.
Which of the following statements best analyzes the impact of these reward distributions on the policy update?
Diagnosing Unstable Reinforcement Learning Training
True or False: In a basic policy gradient method, if an agent completes a trajectory with a high positive total reward, the learning algorithm will reinforce every action taken during that trajectory, even those that were suboptimal or did not directly contribute to the final outcome.
Impact of Reward Scale Variation on Policy Gradient Variance
Learn After
Analyzing Training Instability from Reward Design
An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:
- Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
- Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}
Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?
Critiquing a Reward Function for Maze Navigation