Formula

Temporal Difference (TD) Error as an Advantage Function Estimator

The temporal difference (TD) error is a common estimator for the advantage function, A(st,at)A(s_t, a_t). This value, denoted A^(st,at)\hat{A}(s_t, a_t), is calculated as the difference between the immediate reward plus the discounted expected value of the next state and the expected value of the current state. The formula for the TD error is: A^(st,at)=rt+γV(st+1)V(st)\hat{A}(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t) This formulation, also known as the one-step advantage estimate, is a foundational component in many actor-critic algorithms. By substituting the action-value function with the return and next state value, it allows the advantage to be efficiently computed using a single critic network V(st)V(s_t).

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related