Multiple Choice

A reinforcement learning agent has developed a new policy, denoted as π_new, for navigating a maze. The goal is to accurately estimate the performance of this specific policy using its on-policy objective function, which is defined as the expected cumulative reward over trajectories generated by the policy itself. Which of the following procedures correctly describes how to gather data and compute this estimate?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science