Rationale for Reward Decomposition
In a common variance reduction technique for training a decision-making agent, the gradient update for an action taken at a specific timestep t is calculated. As an intermediate step, the total sum of rewards for the entire sequence, (∑_{k=1}^{T} r_k), is algebraically rewritten as the sum of two separate components: (∑_{k=1}^{t-1} r_k) + (∑_{k=t}^{T} r_k). Explain the fundamental principle this mathematical step is designed to leverage and why this specific decomposition is a necessary prerequisite for applying that principle.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient with Reward-to-Go and Baseline
In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:
Here,
tis a specific timestep within the trajectory of lengthT,\pi_\theta(a_t|s_t)is the probability of taking actiona_tin states_t,r_kis the reward at timestepk, andbis a constant value. Which statement best analyzes the relationship between the policy term for timestept(\nabla_\theta \log \pi_\theta(a_t|s_t)) and the two components of the reward sum?In the context of improving a policy gradient estimator, the total reward for a trajectory, ( \sum_{k=1}^{T} r_k ), is often rewritten inside the gradient calculation for a specific timestep
tas ( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k ). This specific algebraic decomposition, by itself, alters the expected value of the gradient estimate.Rationale for Reward Decomposition