1Cademy - Simplifying Advantage Function Calculation in A2C

Learn Before

Advantage Actor-Critic (A2C) Method

Concept

Simplifying Advantage Function Calculation in A2C

At first glance, the Advantage Actor-Critic (A2C) model may seem challenging to develop because the advantage function $A(s_t,a_t) = Q(s_t,a_t) - V(s_t)$ appears to require two separate sub-models for the action-value function $Q$ and the state-value function $V$ . However, by expressing the $Q$ -value as the immediate return plus the value of the next state, $Q(s_t,a_t) = r_t + V(s_{t+1})$ , the equation can be rewritten as $A(s_t,a_t) = r_t + V(s_{t+1}) - V(s_t)$ . Introducing the discount factor $\gamma$ generalizes this to the temporal difference (TD) error: $A(s_t,a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$ . This means A2C only needs to train a single critic network for the value function $V(s_t)$ to compute the advantage.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related