1Cademy - Off-Policy Performance Estimation

Learn Before

Off-Policy Objective Function with Importance Sampling

Case Study

Off-Policy Performance Estimation

You are evaluating a new 'target' policy using a batch of 100 trajectories collected with an older 'reference' policy. This batch contains two distinct types of trajectories, A and B. Based on the data below, calculate the estimated performance of the target policy. The performance estimate is the average of the importance-weighted rewards over the entire batch.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related