Policy Gradient with Baseline
To reduce the variance of the policy gradient estimator, a baseline term, , can be subtracted from the total trajectory reward, . This modification does not introduce bias into the gradient estimate as long as the baseline does not depend on the action . The resulting formula for the policy gradient is: A common choice for the baseline is an estimate of the state-value function, .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Gradient Estimation from Sampled Trajectories
An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories
τsampled from the policyπ_θ:∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]In practice, this is estimated from a batch of
|D|sampled trajectories using the following formula:∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?
Policy Gradient with Baseline
Reward-to-Go
An agent is being trained using a policy gradient method. A batch of data
Dis collected, containing exactly two trajectories,τ_1andτ_2.- Trajectory
τ_1has a total rewardR(τ_1) = 10. - Trajectory
τ_2has a total rewardR(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters
θis denoted as∇_θ log Pr_θ(τ_1)and∇_θ log Pr_θ(τ_2), respectively.Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient
∇J(θ)for this batch?- Trajectory
Policy Gradient Estimate with Baseline
Baseline's Role in Centering Rewards and Reducing Gradient Variance
State-Value Function as a Baseline
Baseline's Impact on Reward Variance vs. Gradient Estimate Variance
An engineer is training two reinforcement learning agents (Agent A and Agent B) on the same task using a policy gradient method. The environment has a wide range of possible total rewards, from highly negative to highly positive. Agent A's learning algorithm directly uses the total reward received after each episode to update its policy. Agent B's algorithm first subtracts a constant value (equal to the average total reward observed so far) from the total reward before using it for the update. What is the most likely difference in the training process between Agent A and Agent B?
Benefit of a Baseline in a Positive-Reward Environment
A reinforcement learning agent is being trained in a specialized environment where the total reward for any complete episode consistently falls within a narrow range of 95 to 105. The training algorithm uses a policy gradient method and incorporates a baseline by subtracting the long-term average reward (approximately 100) from each episode's total reward before performing an update. Which statement best evaluates the utility of this baseline in this specific scenario?
Policy Gradient with Baseline
Learn After
Decomposition of Reward Sum for Causality in Policy Gradients
In policy gradient methods, a baseline
bis subtracted from the total reward for a trajectory,R(τ), to reduce the variance of the gradient estimate. The update for a trajectory is proportional to(∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b). Which of the following would be a valid and effective choice for the baselineb?In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep
tis an estimate of the value of the specific actiona_ttaken in states_t. What is the primary theoretical problem with this choice of baseline?Rationale for Using a Baseline in Policy Gradients