Concept

State-Value Function as a Baseline

A common and effective strategy for setting the baseline, bb, in policy gradient methods is to use the state-value function, V(st)V(s_t). This function represents the expected cumulative future reward from a given state sts_t, formally defined as V(st)=E[rt+rt+1++rT]V(s_t) = E[r_t + r_{t+1} + \dots + r_T]. Using the value of the current state as a baseline helps to center the rewards and can significantly reduce the variance of the gradient estimate.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences