Learn Before
State-Value Function (V) Formula
The state-value function, denoted as , quantifies the expected discounted return (the sum of accumulated rewards) an agent will receive if it starts in a specific state and strictly follows a given policy thereafter. Mathematically, it is expressed as the expectation over all possible state-action trajectories:
This can also be expanded to explicitly show the individually discounted future rewards:
In this formula, () is the discount factor that controls the weight of future rewards, specifies the initial starting state, and is the reward at time step .

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Bellman Equation
State-Value Function (V) Formula
An agent is in a state
sand must choose between two actions:AandB. According to the agent's current policy, it chooses actionAwith a 70% probability and actionBwith a 30% probability. The expected total future reward for taking actionAfrom statesis +20. The expected total future reward for taking actionBfrom statesis -10. Based on this information, which of the following statements correctly describes the relationship between the value of being in statesand the values of taking each action?An agent is learning to navigate a complex environment. Match each of the following questions the agent might have with the type of value function that would most directly provide the answer.
RLHF Component Interaction during Token Generation
Action-Value Function Definition
Drone Navigation Decision Analysis
Advantage Function in Terms of Q-values and V-values
Learn After
An agent is in a state 'S' and must choose between two policies, Policy A and Policy B. The sequence of rewards the agent will receive after starting in state 'S' and following each policy is deterministic and known:
- Policy A Reward Sequence:
[+10, +1, +1, +1, ...] - Policy B Reward Sequence:
[+3, +3, +3, +3, ...]
Given the formula for the value of a state, , which of the following statements correctly analyzes the relationship between the discount factor
γand the value of state 'S' for each policy?- Policy A Reward Sequence:
Calculating State Value in a Deterministic Environment
Advantage Function Formula
Temporal Difference (TD) Error as an Advantage Function Estimator
An agent is in a state 'S' and follows a fixed policy. From this state, the environment is stochastic: there is a 50% chance the agent will enter a trajectory with a reward sequence of [+10, 0, 0, ...] and a 50% chance it will enter a different trajectory with a reward sequence of [0, +10, 0, ...]. Given the state-value formula and a discount factor (γ) of 0.9, what is the value of state 'S'?