1Cademy - In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agents policy, a baseline value of +12 is subtracted from each trajectorys total reward. Based on this information, how will the policy updates derived from these two trajectories differ?

Learn Before

Policy Gradient Estimate with Baseline

Multiple Choice

In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agent's policy, a baseline value of +12 is subtracted from each trajectory's total reward. Based on this information, how will the policy updates derived from these two trajectories differ?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related