Consider the calculation of the policy gradient with respect to the parameters θ for an action a_t taken at time t. A proposed update rule multiplies the term ∇_θ log π_θ(a_t|s_t) by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before time t (∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient with Reward-to-Go and Baseline
In a reinforcement learning setting, the update for a policy
π_θ(a_t|s_t)depends on a gradient calculation. For a specific timesteptwithin a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before timetfrom those that occurred at or after timet:log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )When computing the gradient of this expression with respect to the policy parameters
θ, how does the∑_{k=1}^{t-1} r_kterm (the sum of past rewards) influence the gradient associated with the action at timet?Policy Gradient Optimization
Consider the calculation of the policy gradient with respect to the parameters θ for an action
a_ttaken at timet. A proposed update rule multiplies the term∇_θ log π_θ(a_t|s_t)by the sum of all rewards in the trajectory (∑_{k=1}^{T} r_k). Including the rewards received before timet(∑_{k=1}^{t-1} r_k) in this multiplication introduces a systematic error, or bias, into the resulting gradient estimate.