Reward-Weighted Probability Distribution
A reward-weighted probability distribution, denoted as , is a new distribution created by modifying a reference distribution, , based on a reward signal, . The reference distribution's probability for each output is scaled by an exponential factor of the reward. The entire expression is then normalized by the partition function to ensure it sums to one and is a valid probability distribution. The formula is:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reward-Weighted Probability Distribution
Consider a scenario where for a given input (\mathbf{x}), there are only two possible outputs, (\mathbf{y}_1) and (\mathbf{y}2). A reference model assigns probabilities (\pi{\text{ref}}(\mathbf{y}1|\mathbf{x}) = 0.6) and (\pi{\text{ref}}(\mathbf{y}_2|\mathbf{x}) = 0.4). A reward function gives scores (r(\mathbf{x}, \mathbf{y}1) = 2) and (r(\mathbf{x}, \mathbf{y}2) = 1). Assuming the scaling factor (\beta) is 1, what is the value of the normalization factor (Z(\mathbf{x})), which is calculated as (Z(\mathbf{x}) = \sum{\mathbf{y}} \pi{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(r(\mathbf{x}, \mathbf{y})))?
Consider the calculation of a normalization factor using the formula: If the reward function (r(\mathbf{x}, \mathbf{y})) consistently returns a value of 0 for all possible outputs (\mathbf{y}), the normalization factor (Z(\mathbf{x})) will always be equal to 1.
Impact of Scaling Factor on Normalization
Learn After
In the formula for a reward-weighted probability distribution, the parameter
βacts as a temperature or inverse scaling factor. How does decreasing the value ofβ(i.e., moving it closer to 0, but remaining positive) affect the final distributionπ*?Applying a Reward Function to a Language Model's Output
Target Policy as a Reward-Weighted Distribution
In the context of a reward-weighted probability distribution, defined as , consider a scenario where a specific output, , receives a very high reward, . However, the reference distribution assigns a probability to this output that is extremely close to zero, i.e., . What will be the approximate probability of in the final distribution, ?