Formula

Reward-Weighted Probability Distribution

A reward-weighted probability distribution, denoted as π(yx)\pi^{*}(\mathbf{y}|\mathbf{x}), is a new distribution created by modifying a reference distribution, πθref(yx)\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}), based on a reward signal, r(x,y)r(\mathbf{x}, \mathbf{y}). The reference distribution's probability for each output y\mathbf{y} is scaled by an exponential factor of the reward. The entire expression is then normalized by the partition function Z(x)Z(\mathbf{x}) to ensure it sums to one and is a valid probability distribution. The formula is: π(yx)=πθref(yx)exp(1βr(x,y))Z(x)\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences