Multiple Choice

In a reinforcement learning setting, the update for a policy π_θ(a_t|s_t) depends on a gradient calculation. For a specific timestep t within a sequence of actions, the term influencing the update can be structured as follows, separating rewards that occurred before time t from those that occurred at or after time t:

log π_θ(a_t|s_t) * ( (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k) )

When computing the gradient of this expression with respect to the policy parameters θ, how does the ∑_{k=1}^{t-1} r_k term (the sum of past rewards) influence the gradient associated with the action at time t?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science