Definition

Reference Policy (πθref\pi_{\theta_{\text{ref}}})

A reference policy, denoted πθref\pi_{\theta_{\text{ref}}}, is a fixed policy used as a baseline in certain reinforcement learning algorithms. It is typically an earlier version of the policy being trained or a pre-trained model. The training process aims to improve the current policy (πθ\pi_\theta) without deviating too far from the reference policy, which helps stabilize learning.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences