Policy Gradient with Reward-to-Go and Baseline
By omitting the term for past rewards that do not contribute to the gradient, we arrive at a simplified version of the policy gradient with a baseline. This formulation isolates the future rewards (the reward-to-go) from time step onwards: Removing the past rewards not only simplifies the calculation but can further reduce the variance of the gradient estimate.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Gradient with Reward-to-Go and Baseline
In a reinforcement learning setting, the update for a policy
π_θ(a_t|s_t)depends on a gradient calculation. For a specific timesteptwithin a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before timetfrom those that occurred at or after timet:log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )When computing the gradient of this expression with respect to the policy parameters
θ, how does the∑_{k=1}^{t-1} r_kterm (the sum of past rewards) influence the gradient associated with the action at timet?Policy Gradient Optimization
Consider the calculation of the policy gradient with respect to the parameters θ for an action
a_ttaken at timet. A proposed update rule multiplies the term∇_θ log π_θ(a_t|s_t)by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before timet(∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.Policy Gradient with Reward-to-Go and Baseline
Calculating Advantage from a Trajectory
In the context of estimating the advantage of taking an action
a_tin a states_t, the formulaA(s_t, a_t) = (∑_{k=t}^{T} r_k) - V(s_t)is often used. What is the primary role of the reward-to-go term,∑_{k=t}^{T} r_k, within this specific estimation?In a given trajectory, if the calculated advantage
A(s_t, a_t) = (∑_{k=t}^{T} r_k) - V(s_t)is negative, it implies that the actiona_ttaken in states_tled to a sequence of rewards that was worse than the average expected outcome from that state.Policy Gradient with Reward-to-Go and Baseline
In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:
Here,
tis a specific timestep within the trajectory of lengthT,\pi_\theta(a_t|s_t)is the probability of taking actiona_tin states_t,r_kis the reward at timestepk, andbis a constant value. Which statement best analyzes the relationship between the policy term for timestept(\nabla_\theta \log \pi_\theta(a_t|s_t)) and the two components of the reward sum?In the context of improving a policy gradient estimator, the total reward for a trajectory, ( \sum_{k=1}^{T} r_k ), is often rewritten inside the gradient calculation for a specific timestep
tas ( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k ). This specific algebraic decomposition, by itself, alters the expected value of the gradient estimate.Rationale for Reward Decomposition
Learn After
Analysis of Policy Update Mechanisms
An agent completes an episode with the following sequence of rewards:
r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time stept=2, a baseline value ofb(s_2) = 5is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action att=2?Stabilizing Policy Gradient Training