Learn Before
Decomposition of Reward Sum for Causality in Policy Gradients
To distinguish between rewards accrued before and after an action at time step , the total reward inside the gradient calculation can be decomposed. The sum is brought inside the gradient operator and split into past and future components: This decomposition explicitly separates the past rewards term (), preparing the equation so this term can be safely omitted to further reduce gradient variance.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Decomposition of Reward Sum for Causality in Policy Gradients
In policy gradient methods, a baseline
bis subtracted from the total reward for a trajectory,R(τ), to reduce the variance of the gradient estimate. The update for a trajectory is proportional to(∇_θ Σ_t log π_θ(a_t|s_t)) * (R(τ) - b). Which of the following would be a valid and effective choice for the baselineb?In a policy gradient algorithm, a researcher attempts to reduce the variance of the gradient estimate by subtracting a baseline from the total reward. The proposed baseline for a given timestep
tis an estimate of the value of the specific actiona_ttaken in states_t. What is the primary theoretical problem with this choice of baseline?Rationale for Using a Baseline in Policy Gradients
Learn After
Policy Gradient with Reward-to-Go and Baseline
In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:
Here,
tis a specific timestep within the trajectory of lengthT,\pi_\theta(a_t|s_t)is the probability of taking actiona_tin states_t,r_kis the reward at timestepk, andbis a constant value. Which statement best analyzes the relationship between the policy term for timestept(\nabla_\theta \log \pi_\theta(a_t|s_t)) and the two components of the reward sum?In the context of improving a policy gradient estimator, the total reward for a trajectory, ( \sum_{k=1}^{T} r_k ), is often rewritten inside the gradient calculation for a specific timestep
tas ( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k ). This specific algebraic decomposition, by itself, alters the expected value of the gradient estimate.Rationale for Reward Decomposition