1Cademy - Justifying the Reference Policy

Learn Before

Optimal Policy as a Product of Reference Policy and Exponentiated Reward

Short Answer

Justifying the Reference Policy

A language model's policy is being shaped using a reward function, $r(\mathbf{x}, \mathbf{y})$ . Instead of defining the policy as being directly proportional to only the exponentiated reward, it is instead defined as proportional to the product of a reference policy and the exponentiated reward: $\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y}))$ . Explain the primary benefit of including the $\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})$ term in this formulation.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related