1Cademy - Analysis of Policy Update Mechanisms

Learn Before

Policy Gradient with Reward-to-Go and Baseline

Case Study

Analysis of Policy Update Mechanisms

An engineer is training two reinforcement learning agents, Agent X and Agent Y, on the same complex task. Both agents use a policy gradient approach, but with different update rules for actions taken within a trajectory. After running several training sessions, the engineer observes that Agent Y learns a successful policy much faster and more consistently than Agent X. The variance of the gradient updates for Agent Y is also significantly lower.

Agent X's Update Rule: For each action taken at time step t, the policy update is weighted by the sum of all rewards from the entire trajectory (from t=1 to T).
Agent Y's Update Rule: For each action taken at time step t, the policy update is weighted by the sum of rewards from that time step onward (from t to T), minus an estimate of the average reward typically received from the current state.

Based on this information, identify and explain the two distinct principles incorporated into Agent Y's update rule that contribute to its superior performance and lower variance compared to Agent X.

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related