1Cademy - Separation of Sampling and Reward Computation in Policy Learning

Learn Before

Surrogate Objective in Reinforcement Learning

Concept

Separation of Sampling and Reward Computation in Policy Learning

The formulation of the surrogate objective allows for a functional separation between the sequence sampling process and the reward computation. By utilizing a baseline policy (parameterized by $\theta_{\mathrm{ref}}$ ) to sample a batch of sequences, and then applying the target policy (parameterized by $\theta$ ) to compute the expected reward, the overall procedure avoids the need to sample directly from the policy being evaluated. This decoupling is especially beneficial in reinforcement learning scenarios where generating trajectories from the target policy is computationally expensive or difficult.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related