Irrelevance of Past Rewards for Policy Gradient Calculation
In Markov decision processes, the future is independent of the past given the present. Consequently, an action taken at time step cannot influence the rewards received before . Since the rewards prior to are already "fixed" by the time the action is chosen, the term representing the sum of these past rewards, , does not contribute to the gradient and can be omitted. Eliminating this term helps to further reduce the variance of the policy gradient.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Irrelevance of Past Rewards for Policy Gradient Calculation
An autonomous agent completes a task over four time steps. The sequence of actions and resulting rewards is as follows:
- Time t=1: Action
a_1-> Rewardr_1 = 0 - Time t=2: Action
a_2-> Rewardr_2 = 0 - Time t=3: Action
a_3-> Rewardr_3 = -1 - Time t=4: Action
a_4-> Rewardr_4 = +10
When evaluating the decision to take action
a_2at time t=2, which rewards should be considered as being potentially influenced by this specific action?- Time t=1: Action
An agent is learning to play a video game. At time step
t=5, the agent performs an action (e.g., jumping). According to the causality principle in this context, this specific action att=5can alter the reward that was already received at time stept=3.Causality Principle in Policy Gradient Calculation
Debugging a Policy Update Calculation
Irrelevance of Past Rewards for Policy Gradient Calculation
Consider the following mathematical derivation, which attempts to rewrite the policy gradient with a baseline. The goal is to separate the reward term into components that occurred before and after a specific action. Analyze the steps and identify which one contains a logical or mathematical error.
Derivation: Let the policy gradient objective be:
Step 1: The reward term, which is constant with respect to the parameters (\theta), is moved inside the derivative:
Step 2: The reward term, which is constant for a given trajectory (\tau), is distributed inside the summation over timesteps (t):
Step 3: The total reward sum (\sum_{k=1}^{T} r_k) is decomposed into rewards before and after the current timestep (t):
Which step introduces an error into the derivation?
Purpose of Reward Decomposition in Policy Gradient
A common technique to improve the stability of a policy-based learning algorithm involves rewriting its core update rule. The goal is to isolate the influence of rewards that occur after an action is taken from those that occur before. Below are four key stages of this mathematical derivation. Arrange them in the correct logical order, from the initial formulation to the final decomposed form. (Note: For simplicity, the expectation over trajectories is omitted).
Learn After
Policy Gradient with Reward-to-Go and Baseline
In a reinforcement learning setting, the update for a policy
π_θ(a_t|s_t)depends on a gradient calculation. For a specific timesteptwithin a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before timetfrom those that occurred at or after timet:log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )When computing the gradient of this expression with respect to the policy parameters
θ, how does the∑_{k=1}^{t-1} r_kterm (the sum of past rewards) influence the gradient associated with the action at timet?Policy Gradient Optimization
Consider the calculation of the policy gradient with respect to the parameters θ for an action
a_ttaken at timet. A proposed update rule multiplies the term∇_θ log π_θ(a_t|s_t)by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before timet(∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.