Learn Before
Formula

Value Network Loss Function in A2C

In the Advantage Actor-Critic (A2C) algorithm, the loss function for the value network (or critic network), parameterized by ω\omega, is defined as the mean squared temporal difference (TD) error over a batch of experiences. The formula is given by: Lv(ω)=1M(rt+γVω(st+1)Vω(st))2\mathcal{L}_v(\omega) = \frac{1}{M} \sum \left( r_t + \gamma V_\omega(s_{t+1}) - V_\omega(s_t) \right)^2 Here, MM is the number of training samples (for example, for a sequence of TT tokens, we can set M=TM=T). The term rt+γVω(st+1)r_t + \gamma V_\omega(s_{t+1}) represents the computed return (TD target), and Vω(st)V_\omega(s_t) is the predicted state value. Minimizing this loss trains the critic to accurately evaluate the expected return.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences