Objective Function as Expected Cumulative Reward (Performance Function)
In reinforcement learning, the objective function , also known as the performance function, evaluates the effectiveness of a policy parameterized by . It is defined as the expected cumulative reward over all possible trajectories . The formula is commonly expressed as: The notation signifies that the trajectory is generated by following the policy . Alternatively, this objective can be written as a sum over the space of all trajectories , weighted by the probability of each trajectory under the policy: Here, is the cumulative reward for a trajectory.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy in the Context of LLMs
Objective Function as Expected Cumulative Reward (Performance Function)
Objective Function as Expected Cumulative Reward (Performance Function)
An agent operates in an environment where sequences of events unfold over time. The agent's behavior is described by a policy, denoted as π(a|s), which gives the probability of taking action 'a' when in state 's'. The environment's dynamics are described by a transition function, P(s'|s, a), which gives the probability of moving to the next state 's'' after taking action 'a' in state 's'. The process begins from an initial state, s₀, with a probability of P(s₀).
Consider the following specific two-step sequence of events (a trajectory):
- The process starts in state s₀.
- The agent takes action a₀.
- The environment transitions to state s₁.
- The agent takes action a₁.
- The environment transitions to state s₂.
Which expression correctly represents the probability of this entire specific trajectory occurring?
Diagnosing a Faulty Sequence Generation Process
When modeling the generation of a sequence of states and actions as a Markov Decision Process, the probability of transitioning to a new state at any given step depends on the complete history of all states and actions that have occurred since the beginning of the sequence.
Notational Variations in State-Action Sequences (Trajectories)
An agent is generating a sequence by interacting with an environment. For a single time step, starting from state
s_t, arrange the following events in the correct logical order.Objective Function as Expected Cumulative Reward (Performance Function)
An agent is being trained to find the best route through a system. It is presented with two options:
- Route 1: Provides a consistent, small positive reward at every step, resulting in a total reward of +15 for the entire route.
- Route 2: Starts with a step that gives a negative reward (a penalty) of -5, but subsequent steps lead to very high rewards, resulting in a total reward of +50 for the entire route.
An agent that has been successfully trained according to the primary objective of its learning framework will learn to choose Route 2. Which of the following statements best explains why?
Analysis of a Suboptimal Agent Policy
An agent is learning to play a game where the objective is to get the highest possible final score. At a critical decision point, the agent chooses an action that yields an immediate reward of 0, passing up an alternative action that would have given an immediate reward of +10. This decision is necessarily an indication that the agent's policy is flawed and not aligned with the primary goal of its learning framework.
Learn After
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison