1Cademy - Advantage Estimation for A2C with a Reward Model

Learn Before

A2C Loss Function Formulation

Concept

Advantage Estimation for A2C with a Reward Model

In the context of the Advantage Actor-Critic (A2C) algorithm, the advantage function $A(s_t, a_t)$ that appears in the utility function is typically estimated using the Temporal Difference (TD) error, calculated as $r_t + \gamma V(s_{t+1}) - V(s_t)$ . The value function $V(s_t)$ used in this estimation is, in turn, trained with a reward model.