Learn Before
Unbiased Nature of Policy Gradient with Baseline
A crucial property of using a baseline in policy gradient methods is that it does not introduce any bias into the gradient estimate. While the baseline reduces the variance of the gradient, its expected value remains unchanged. This ensures that, on average, the policy updates still move in the correct direction to improve the policy.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Derivation of Reward Decomposition in Policy Gradient with Baseline
Unbiased Nature of Policy Gradient with Baseline
In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agent's policy, a baseline value of +12 is subtracted from each trajectory's total reward. Based on this information, how will the policy updates derived from these two trajectories differ?
Consider the formula for the policy gradient estimate with a baseline: According to this formula, the baseline value
bis subtracted from the rewardr_tat each individual timesteptwithin a trajectory to reduce variance.Stabilizing Policy Gradient Training
Learn After
Analysis of the Baseline's Effect on Policy Gradient Expectation
In a policy gradient algorithm, a common technique to stabilize learning is to subtract a calculated value from the total reward of each trajectory before computing the update. This is done to reduce the variability of the updates without altering their expected direction. Which of the following calculated values, if subtracted from the total reward, would introduce an incorrect bias and potentially lead the policy updates in the wrong direction on average?
In the mathematical proof demonstrating that a state-dependent baseline
b(s_t)does not introduce bias to the policy gradient estimate, the expected value of the baseline-related term,E[ (∇θ log πθ(a_t|s_t)) * b(s_t) ], evaluates to zero. Which of the following is the fundamental reason for this outcome?