1Cademy - In a framework for aligning language models, a reward function is defined as: $$ r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right) $$ where $\pi_{\theta}$ is the target policy, $\pi_{\theta_{\text{ref}}}$ is a reference policy, $\beta$ is a scaling factor, and $Z(\mathbf{x})$ is a normalization factor dependent on the prompt $\mathbf{x}$. Given two distinct responses, $\mathbf{y}_a$ and $\mathbf{y}_b$, to the same prompt $\mathbf{x}$, which expression correctly represents the difference in their rewards, $r(\mathbf{x}, \mathbf{y}_a)

Learn Before

Reward Function in Terms of Policy Models and Normalization Factor

Multiple Choice

In a framework for aligning language models, a reward function is defined as: $r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)$ where $\pi_{\theta}$ is the target policy, $\pi_{\theta_{\text{ref}}}$ is a reference policy, $\beta$ is a scaling factor, and $Z(\mathbf{x})$ is a normalization factor dependent on the prompt $\mathbf{x}$ . Given two distinct responses, $\mathbf{y}_a$ and $\mathbf{y}_b$ , to the same prompt $\mathbf{x}$ , which expression correctly represents the difference in their rewards, $r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b)$ ?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related