Learn Before
In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agent's policy, a baseline value of +12 is subtracted from each trajectory's total reward. Based on this information, how will the policy updates derived from these two trajectories differ?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Derivation of Reward Decomposition in Policy Gradient with Baseline
Unbiased Nature of Policy Gradient with Baseline
In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agent's policy, a baseline value of +12 is subtracted from each trajectory's total reward. Based on this information, how will the policy updates derived from these two trajectories differ?
Consider the formula for the policy gradient estimate with a baseline: According to this formula, the baseline value
bis subtracted from the rewardr_tat each individual timesteptwithin a trajectory to reduce variance.Stabilizing Policy Gradient Training