1Cademy - Reference Policy and Model Probability

Learn Before

Reference Policy Definition in RLHF

Short Answer

Reference Policy and Model Probability

In a system that learns from human feedback, a 'reference model' with a fixed set of parameters, $\theta_{\text{ref}}$ , is used to generate a probability distribution, $\text{Pr}_{ heta_{\text{ref}}}(\cdot)$ . Explain the precise relationship between this probability distribution and the system's 'reference policy', denoted as $\pi_{\theta_{\text{ref}}}(\cdot)$ .

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related