Multiple Choice

An agent completes an episode with the following sequence of rewards: r_1 = -1, r_2 = -1, r_3 = -1, r_4 = +10. When updating the policy for the action taken at time step t=2, a baseline value of b(s_2) = 5 is used. According to the policy gradient method that incorporates both reward-to-go and a baseline, what is the numerical value of the term that multiplies the gradient of the log-probability of the action at t=2?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science