1Cademy - An agent is trained using a policy gradient method where the policy is updated based on the total reward of an entire trajectory. Consider two different trajectories that result in the same total reward: - **Trajectory A:** The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10. - **Trajectory B:** The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10. Which of the following statements best analyzes the impact of these reward distributions on the policy update?

Learn Before

High Variance in Policy Gradient Estimates

Multiple Choice

An agent is trained using a policy gradient method where the policy is updated based on the total reward of an entire trajectory. Consider two different trajectories that result in the same total reward:

Trajectory A: The agent receives a small, consistent reward of +1 at each of 10 steps, for a total reward of +10.
Trajectory B: The agent receives a reward of 0 for the first 9 steps and a large reward of +10 at the final step, for a total reward of +10.

Which of the following statements best analyzes the impact of these reward distributions on the policy update?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related