1Cademy - A language models behavior is guided by a target probability distribution, π*, which is defined by re-weighting a reference distribution, π_ref, based on a reward score, r(x, y). The relationship is given by the formula: $$ \pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})} $$ In this formula, β is a positive scalar parameter. Analyze the effect of significantly *increasing* the value of β. Wh

Learn Before

Target Policy as a Reward-Weighted Distribution

Multiple Choice

A language model's behavior is guided by a target probability distribution, π*, which is defined by re-weighting a reference distribution, π_ref, based on a reward score, r(x, y). The relationship is given by the formula: $\pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp \left(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})\right)}{Z(\mathbf{x})}$ In this formula, β is a positive scalar parameter. Analyze the effect of significantly increasing the value of β. Wh

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related