Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.
Derivation: Let the policy gradient objective be:
Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative:
Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t):
Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t):
Which step introduces an error into the derivation?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Irrelevance of Past Rewards for Policy Gradient Calculation
Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.
Derivation: Let the policy gradient objective be:
Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative:
Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t):
Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t):
Which step introduces an error into the derivation?
Purpose of Reward Decomposition in Policy Gradient
A common technique to improve the stability of a policy-based learning algorithm involves rewriting its core update rule. The goal is to isolate the influence of rewards that occur after an action is taken from those that occur before. Below are four key stages of this mathematical derivation. Arrange them in the correct logical order, from the initial formulation to the final decomposed form. (Note: For simplicity, the expectation over trajectories is omitted).