Learn Before
An agent completes an episode with the following sequence of rewards: r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time step t=2, a baseline value of b(s_2) = 5 is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action at t=2?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analysis of Policy Update Mechanisms
An agent completes an episode with the following sequence of rewards:
r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time stept=2, a baseline value ofb(s_2) = 5is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action att=2?Stabilizing Policy Gradient Training