Concept

Separation of Sampling and Reward Computation in Policy Learning

The formulation of the surrogate objective allows for a functional separation between the sequence sampling process and the reward computation. By utilizing a baseline policy (parameterized by θref\theta_{\mathrm{ref}}) to sample a batch of sequences, and then applying the target policy (parameterized by θ\theta) to compute the expected reward, the overall procedure avoids the need to sample directly from the policy being evaluated. This decoupling is especially beneficial in reinforcement learning scenarios where generating trajectories from the target policy is computationally expensive or difficult.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences