1Cademy - Consider a scenario where for a given input \(\mathbf{x}\), there are only two possible outputs, \(\mathbf{y}_1\) and \(\mathbf{y}_2\). A reference model assigns probabilities \(\pi_{\text{ref}}(\mathbf{y}_1|\mathbf{x}) = 0.6\) and \(\pi_{\text{ref}}(\mathbf{y}_2|\mathbf{x}) = 0.4\). A reward function gives scores \(r(\mathbf{x}, \mathbf{y}_1) = 2\) and \(r(\mathbf{x}, \mathbf{y}_2) = 1\). Assuming the scaling factor \(\beta\) is 1, what is the value of the normalization factor \(Z(\mathbf{x})\), which is calculated as \(Z(\mathbf{x}) = \sum_{\mathbf{y}} \pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(r(\mathbf{x}, \mathbf{y}))\)?

Learn Before

Normalization Factor for a Reward-Weighted Policy

Multiple Choice

Consider a scenario where for a given input (\mathbf{x}), there are only two possible outputs, (\mathbf{y}_1) and (\mathbf{y}2). A reference model assigns probabilities (\pi{\text{ref}}(\mathbf{y}1|\mathbf{x}) = 0.6) and (\pi{\text{ref}}(\mathbf{y}_2|\mathbf{x}) = 0.4). A reward function gives scores (r(\mathbf{x}, \mathbf{y}1) = 2) and (r(\mathbf{x}, \mathbf{y}2) = 1). Assuming the scaling factor (\beta) is 1, what is the value of the normalization factor (Z(\mathbf{x})), which is calculated as (Z(\mathbf{x}) = \sum{\mathbf{y}} \pi{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(r(\mathbf{x}, \mathbf{y})))?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related