Short Answer

Justifying the Reference Policy

A language model's policy is being shaped using a reward function, r(x,y)r(\mathbf{x}, \mathbf{y}). Instead of defining the policy as being directly proportional to only the exponentiated reward, it is instead defined as proportional to the product of a reference policy and the exponentiated reward: πref(yx)exp(1βr(x,y))\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})). Explain the primary benefit of including the πref(yx)\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) term in this formulation.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science