In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:
Here, t is a specific timestep within the trajectory of length T, \pi_\theta(a_t|s_t) is the probability of taking action a_t in state s_t, r_k is the reward at timestep k, and b is a constant value. Which statement best analyzes the relationship between the policy term for timestep t ( \nabla_\theta \log \pi_\theta(a_t|s_t) ) and the two components of the reward sum?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient with Reward-to-Go and Baseline
In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:
Here,
tis a specific timestep within the trajectory of lengthT,\pi_\theta(a_t|s_t)is the probability of taking actiona_tin states_t,r_kis the reward at timestepk, andbis a constant value. Which statement best analyzes the relationship between the policy term for timestept(\nabla_\theta \log \pi_\theta(a_t|s_t)) and the two components of the reward sum?In the context of improving a policy gradient estimator, the total reward for a trajectory, ( \sum_{k=1}^{T} r_k ), is often rewritten inside the gradient calculation for a specific timestep
tas ( \sum_{k=1}^{t-1} r_k + \sum_{k=t}^{T} r_k ). This specific algebraic decomposition, by itself, alters the expected value of the gradient estimate.Rationale for Reward Decomposition