1Cademy - An engineer is training two reinforcement learning agents (Agent A and Agent B) on the same task using a policy gradient method. The environment has a wide range of possible total rewards, from highly negative to highly positive. Agent As learning algorithm directly uses the total reward received after each episode to update its policy. Agent Bs algorithm first subtracts a constant value (equal to the average total reward observed so far) from the total reward before using it for the update. What is the most likely difference in the training process between Agent A and Agent B?

Learn Before

Baseline Method for Policy Gradient Variance Reduction

Multiple Choice

An engineer is training two reinforcement learning agents (Agent A and Agent B) on the same task using a policy gradient method. The environment has a wide range of possible total rewards, from highly negative to highly positive. Agent A's learning algorithm directly uses the total reward received after each episode to update its policy. Agent B's algorithm first subtracts a constant value (equal to the average total reward observed so far) from the total reward before using it for the update. What is the most likely difference in the training process between Agent A and Agent B?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related