Learn Before
Derivation of Reward Decomposition in Policy Gradient with Baseline
The policy gradient with a baseline can be mathematically manipulated to separate past and future rewards, which is a key step toward applying the causality principle for variance reduction. The derivation begins with the standard policy gradient formula with a baseline. The total reward term is then distributed into the sum over timesteps, and subsequently, this total reward sum is decomposed into rewards accumulated before the current timestep and rewards from timestep onward. The derivation proceeds as follows: This final expression makes the distinction between past and future rewards explicit, setting the stage for eliminating the irrelevant past rewards from the gradient calculation.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Derivation of Reward Decomposition in Policy Gradient with Baseline
Unbiased Nature of Policy Gradient with Baseline
In a reinforcement learning task, an agent completes two distinct trajectories. Trajectory A results in a total reward of +20, and Trajectory B results in a total reward of +5. To update the agent's policy, a baseline value of +12 is subtracted from each trajectory's total reward. Based on this information, how will the policy updates derived from these two trajectories differ?
Consider the formula for the policy gradient estimate with a baseline: According to this formula, the baseline value
bis subtracted from the rewardr_tat each individual timesteptwithin a trajectory to reduce variance.Stabilizing Policy Gradient Training
Learn After
Irrelevance of Past Rewards for Policy Gradient Calculation
Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.
Derivation: Let the policy gradient objective be:
Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative:
Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t):
Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t):
Which step introduces an error into the derivation?
Purpose of Reward Decomposition in Policy Gradient
A common technique to improve the stability of a policy-based learning algorithm involves rewriting its core update rule. The goal is to isolate the influence of rewards that occur after an action is taken from those that occur before. Below are four key stages of this mathematical derivation. Arrange them in the correct logical order, from the initial formulation to the final decomposed form. (Note: For simplicity, the expectation over trajectories is omitted).