Learn Before
An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward r_t for each transition (s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state, V(s). Given a discount factor γ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking action a_t in state s_t?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward
r_tfor each transition(s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state,V(s). Given a discount factorγ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking actiona_tin states_t?Calculating Advantage Estimate
Debugging Advantage Estimation in A2C