Policy Gradient Optimization
In the context of calculating the gradient for a policy update at a specific time t, the full sum of rewards for an episode (∑_{k=1}^{T} r_k) is often used to weight the score function ∇_θ log π_θ(a_t|s_t). However, a more refined approach replaces the full sum of rewards with only the sum of rewards from time t onwards (∑_{k=t}^{T} r_k). Analyze this modification by explaining both the conceptual principle and the mathematical property that justify ignoring the sum of past rewards (∑_{k=1}^{t-1} r_k) in this calculation.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient with Reward-to-Go and Baseline
In a reinforcement learning setting, the update for a policy
π_θ(a_t|s_t)depends on a gradient calculation. For a specific timesteptwithin a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before timetfrom those that occurred at or after timet:log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )When computing the gradient of this expression with respect to the policy parameters
θ, how does the∑_{k=1}^{t-1} r_kterm (the sum of past rewards) influence the gradient associated with the action at timet?Policy Gradient Optimization
Consider the calculation of the policy gradient with respect to the parameters θ for an action
a_ttaken at timet. A proposed update rule multiplies the term∇_θ log π_θ(a_t|s_t)by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before timet(∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.