Learn Before
Concept

Advantage Estimation for A2C with a Reward Model

In the context of the Advantage Actor-Critic (A2C) algorithm, the advantage function A(st,at)A(s_t, a_t) that appears in the utility function is typically estimated using the Temporal Difference (TD) error, calculated as rt+γV(st+1)V(st)r_t + \gamma V(s_{t+1}) - V(s_t). The value function V(st)V(s_t) used in this estimation is, in turn, trained with a reward model.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences