Formula

Normalization Factor for a Reward-Weighted Policy

The normalization factor, often denoted as Z(x)Z(\mathbf{x}), is a crucial component for converting an unnormalized, reward-weighted function into a valid probability distribution. It is calculated by summing or integrating the product of a reference policy, πθref(yx)\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}), and an exponentiated, scaled reward, exp(1βr(x,y))\exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})), over the entire domain of possible outputs y\mathbf{y}. The formula is: Z(x)=yπθref(yx)exp(1βr(x,y))Z(\mathbf{x}) = \sum_{\mathbf{y}} \pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right) By dividing the unnormalized function by this factor, the resulting distribution is guaranteed to sum to one.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences