Policy Performance Comparison
An agent is at a starting point and must choose between two paths. Path 1 results in a trajectory with a total reward of +10. Path 2 results in a trajectory with a total reward of +2. You are tasked with evaluating two different policies for the agent. Based on the objective function, which is defined as the expected cumulative reward over all possible trajectories (), which policy performs better? Justify your answer by calculating the performance measure for each policy.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison