Formula

Advantage Function Estimation using Reward-to-Go

The advantage at a time step t, denoted as A(st,at)A(s_t, a_t), quantifies the relative benefit of taking a specific action compared to the expected value of following the policy from state sts_t onward. It can be estimated by subtracting a baseline from the actual return. Using the state-value function V(st)V(s_t) as the baseline, the formula is: A(st,at)=k=tTrkV(st)A(s_t, a_t) = \sum_{k=t}^{T} r_k - V(s_t) In this equation, the term k=tTrk\sum_{k=t}^{T} r_k represents the actual return received from time step tt, while V(st)V(s_t) represents the expected return from state sts_t.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences