1Cademy - An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward `r_t` for each transition `(s_t, a_t, s_{t+1})`. The agent also maintains a value network that estimates the expected future return from any given state, `V(s)`. Given a discount factor `γ`, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking action `a_t` in state `s

Learn Before

Advantage Estimation for A2C with a Reward Model

Multiple Choice

An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward r_t for each transition (s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state, V(s). Given a discount factor γ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking action a_t in state s_t?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related