Learn Before
Concept

Simplifying Advantage Function Calculation in A2C

At first glance, the Advantage Actor-Critic (A2C) model may seem challenging to develop because the advantage function A(st,at)=Q(st,at)V(st)A(s_t,a_t) = Q(s_t,a_t) - V(s_t) appears to require two separate sub-models for the action-value function QQ and the state-value function VV. However, by expressing the QQ-value as the immediate return plus the value of the next state, Q(st,at)=rt+V(st+1)Q(s_t,a_t) = r_t + V(s_{t+1}), the equation can be rewritten as A(st,at)=rt+V(st+1)V(st)A(s_t,a_t) = r_t + V(s_{t+1}) - V(s_t). Introducing the discount factor γ\gamma generalizes this to the temporal difference (TD) error: A(st,at)=rt+γV(st+1)V(st)A(s_t,a_t) = r_t + \gamma V(s_{t+1}) - V(s_t). This means A2C only needs to train a single critic network for the value function V(st)V(s_t) to compute the advantage.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences