Learn Before
Debugging Advantage Estimation in A2C
Analyze the following scenario and identify the most likely reason for the observed training issue. Explain your reasoning by first calculating the advantage estimate and then interpreting its components.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An actor-critic agent is being trained to perform a task where explicit rewards are not available from the environment. Instead, a separate, pre-trained reward model provides a scalar reward
r_tfor each transition(s_t, a_t, s_{t+1}). The agent also maintains a value network that estimates the expected future return from any given state,V(s). Given a discount factorγ, which of the following correctly represents the one-step temporal difference (TD) error used to estimate the advantage of taking actiona_tin states_t?Calculating Advantage Estimate
Debugging Advantage Estimation in A2C