Formula

State-Value Function (V) Formula

The state-value function, denoted as V(s)V(s), quantifies the expected discounted return (the sum of accumulated rewards) an agent will receive if it starts in a specific state ss and strictly follows a given policy π\pi thereafter. Mathematically, it is expressed as the expectation over all possible state-action trajectories:

V(s)=E[t=0γtrt  s0=s,π]V(s) = \mathbb{E} \Big[ \sum_{t=0}^{\infty} \gamma^{t} r_t \ \big | \ s_0 = s, \pi \Big]

This can also be expanded to explicitly show the individually discounted future rewards:

V(s)=E[r(s0,a0,s1)+γr(s1,a1,s2)+γ2r(s2,a2,s3)+  s0=s,π]V(s) = \mathbb{E} \Big[ r(s_0,a_0,s_1) + \gamma r(s_1,a_1,s_2) + \gamma^2 r(s_2,a_2,s_3) + \cdots \ \big | \ s_0 = s, \pi \Big]

V(s)=E[r0+γr1+γ2r2+  s0=s,π]V(s) = \mathbb{E} \Big[ r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \ \big | \ s_0 = s, \pi \Big]

In this formula, γ\gamma (0γ1{}0 \le \gamma \le 1) is the discount factor that controls the weight of future rewards, s0=ss_0 = s specifies the initial starting state, and rtr_t is the reward at time step tt.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences