On-Policy Objective Function (Performance Measure)
The performance of a policy in reinforcement learning is measured by an objective function, . This function is defined as the expected cumulative reward, , over the distribution of trajectories generated by following the policy. The goal of the agent is to find the policy parameters that maximize this value. The formula is:
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison
Learn After
Equivalence of the Surrogate Objective and the On-Policy Objective
A reinforcement learning agent has developed a new policy, denoted as π_new, for navigating a maze. The goal is to accurately estimate the performance of this specific policy using its on-policy objective function, which is defined as the expected cumulative reward over trajectories generated by the policy itself. Which of the following procedures correctly describes how to gather data and compute this estimate?
Evaluating a New Robotic Arm Policy
A research team is training an agent and has a policy represented by parameters θ_current. To evaluate the performance of this policy using its on-policy objective function, J(θ_current), the team can use a large, pre-existing dataset of trajectories that were collected while the agent was operating under a slightly older set of parameters, θ_previous.