Learn Before
Calculating Advantage Estimate
An agent in a reinforcement learning system takes an action in a given state, resulting in a transition to a new state. A value network provides the following estimates for the states: the value of the current state is 2.5, and the value of the next state is 3.0. A separate reward model provides an immediate reward of 0.5 for this transition. Assuming a discount factor of 0.9, calculate the one-step temporal difference error used as an estimate for the advantage of the action taken. Show your calculation.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward
r_tfor each transition(s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state,V(s). Given a discount factorγ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking actiona_tin states_t?Calculating Advantage Estimate
Debugging Advantage Estimation in A2C