1Cademy - An agent completes an episode with the following sequence of rewards: `r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10`. When updating the policy for the action taken at time step `t=2`, a baseline value of `b(s_2) = 5` is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action at `t=2`?

Learn Before

Policy Gradient with Reward-to-Go and Baseline

Multiple Choice

An agent completes an episode with the following sequence of rewards: r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time step t=2, a baseline value of b(s_2) = 5 is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action at t=2?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related