Evaluating a New Robotic Arm Policy
A robotics team wants to evaluate their new policy, denoted as π_new, for a robotic arm. To do this, they need to estimate the on-policy performance measure, J(π_new). They have access to two datasets of the arm's interactions with its environment. Based on the case study details below, which dataset should they use, and why is the other dataset unsuitable for calculating this specific performance measure?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
A reinforcement learning agent has developed a new policy, denoted as π_new, for navigating a maze. The goal is to accurately estimate the performance of this specific policy using its on-policy objective function, which is defined as the expected cumulative reward over trajectories generated by the policy itself. Which of the following procedures correctly describes how to gather data and compute this estimate?
Evaluating a New Robotic Arm Policy
A research team is training an agent and has a policy represented by parameters θ_current. To evaluate the performance of this policy using its on-policy objective function, J(θ_current), the team can use a large, pre-existing dataset of trajectories that were collected while the agent was operating under a slightly older set of parameters, θ_previous.