Off-Policy Objective Function with Importance Sampling
In off-policy reinforcement learning, the performance of a target policy can be evaluated using data generated from a different, reference policy . This is achieved by reformulating the objective function with importance sampling. The objective becomes the expected cumulative reward under the reference policy, where each reward is weighted by the ratio of the trajectory probabilities between the target and reference policies. The formula is: The term is the importance sampling weight, which adjusts for the discrepancy in data collection policies.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison
Learn After
Surrogate Objective in Reinforcement Learning
Equivalence of the Surrogate Objective and the On-Policy Objective
An agent's performance is being evaluated using a set of recorded experiences (trajectories) that were generated by an older, reference policy. The new, target policy being evaluated makes a specific high-reward trajectory significantly less probable than the reference policy did. How will the contribution of this specific high-reward trajectory be adjusted when estimating the performance of the new target policy?
Off-Policy Performance Estimation
Consider an off-policy evaluation scenario where the performance of a 'target' policy is estimated using data collected from a 'reference' policy. If the target policy is identical to the reference policy, the importance sampling weight used to adjust the reward of every possible trajectory will be exactly 1.