1Cademy - Applying a Reward Function to a Language Models Output

Learn Before

Reward-Weighted Probability Distribution

Case Study

Applying a Reward Function to a Language Model's Output

Using the formula for a reward-weighted probability distribution, determine which completion becomes the most likely after applying the reward function. Explain your steps, including the calculation of the final probability for each completion.

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

In the formula for a reward-weighted probability distribution, $\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$ the parameter β acts as a temperature or inverse scaling factor. How does decreasing the value of β (i.e., moving it closer to 0, but remaining positive) affect the final distribution π*?
Applying a Reward Function to a Language Model's Output
Target Policy as a Reward-Weighted Distribution
In the context of a reward-weighted probability distribution, defined as $\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$ , consider a scenario where a specific output, $\mathbf{y}_A$ , receives a very high reward, $r(\mathbf{x}, \mathbf{y}_A)$ . However, the reference distribution assigns a probability to this output that is extremely close to zero, i.e., $\pi_{\theta_{\text{ref}
Effect of Temperature Parameter on Reward-Weighted Distributions

Learn Before

Related