Learn Before
Rationale for Using a Baseline in Policy Gradients
In the context of training a policy, consider a scenario where all rewards for a set of actions are positive (e.g., ranging from +10 to +20). Without a baseline, the term weighting the gradient update is always positive. Explain how subtracting a baseline (e.g., the average reward of +15) from the total reward helps to create a more stable and effective learning signal, specifically addressing how it reduces the variance of the gradient estimate.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Decomposition of Reward Sum for Causality in Policy Gradients
In policy gradient methods, a baseline
bis subtracted from the total reward for a trajectory,R(τ), to reduce the variance of the gradient estimate. The update for a trajectory is proportional to(∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b). Which of the following would be a valid and effective choice for the baselineb?In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep
tis an estimate of the value of the specific actiona_ttaken in states_t. What is the primary theoretical problem with this choice of baseline?Rationale for Using a Baseline in Policy Gradients