Learn Before
In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep t is an estimate of the value of the specific action a_t taken in state s_t. What is the primary theoretical problem with this choice of baseline?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Decomposition of Reward Sum for Causality in Policy Gradients
In policy gradient methods, a baseline
bis subtracted from the total reward for a trajectory,R(τ), to reduce the variance of the gradient estimate. The update for a trajectory is proportional to(∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b). Which of the following would be a valid and effective choice for the baselineb?In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep
tis an estimate of the value of the specific actiona_ttaken in states_t. What is the primary theoretical problem with this choice of baseline?Rationale for Using a Baseline in Policy Gradients