Multiple Choice

An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward r_t for each transition (s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state, V(s). Given a discount factor γ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking action a_t in state s_t?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science