Off-Policy Performance Estimation
You are evaluating a new 'target' policy using a batch of 100 trajectories collected with an older 'reference' policy. This batch contains two distinct types of trajectories, A and B. Based on the data below, calculate the estimated performance of the target policy. The performance estimate is the average of the importance-weighted rewards over the entire batch.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Surrogate Objective in Reinforcement Learning
Equivalence of the Surrogate Objective and the On-Policy Objective
An agent's performance is being evaluated using a set of recorded experiences (trajectories) that were generated by an older, reference policy. The new, target policy being evaluated makes a specific high-reward trajectory significantly less probable than the reference policy did. How will the contribution of this specific high-reward trajectory be adjusted when estimating the performance of the new target policy?
Off-Policy Performance Estimation
Consider an off-policy evaluation scenario where the performance of a 'target' policy is estimated using data collected from a 'reference' policy. If the target policy is identical to the reference policy, the importance sampling weight used to adjust the reward of every possible trajectory will be exactly 1.