A research team is training an agent and has a policy represented by parameters θ_current. To evaluate the performance of this policy using its on-policy objective function, J(θ_current), the team can use a large, pre-existing dataset of trajectories that were collected while the agent was operating under a slightly older set of parameters, θ_previous.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
A reinforcement learning agent has developed a new policy, denoted as π_new, for navigating a maze. The goal is to accurately estimate the performance of this specific policy using its on-policy objective function, which is defined as the expected cumulative reward over trajectories generated by the policy itself. Which of the following procedures correctly describes how to gather data and compute this estimate?
Evaluating a New Robotic Arm Policy
A research team is training an agent and has a policy represented by parameters θ_current. To evaluate the performance of this policy using its on-policy objective function, J(θ_current), the team can use a large, pre-existing dataset of trajectories that were collected while the agent was operating under a slightly older set of parameters, θ_previous.