1Cademy - In a particular policy optimization framework, the target policy, denoted as $\pi_{\theta}(\mathbf{y}|\mathbf{x})$, is determined by the following relationship involving a reference policy $\pi_{\theta_{\text{ref}}}$, a reward function $r(\mathbf{x}, \mathbf{y})$, a positive temperature parameter $\beta$, and a normalization term $Z(\mathbf{x})$: $$ \pi_{\theta}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y}))}{Z(\mathbf{x})} $$ Given this formula, what is the primary effect of significantly increasing the reward $r(\mathbf{x}, \mathbf{y})$ for a single, specific output $\mathbf{y}$, while keeping all other factors constant?

Learn Before

Solution to KL Divergence Minimization for Policy Optimization

Multiple Choice

In a particular policy optimization framework, the target policy, denoted as $\pi_{\theta}(\mathbf{y}|\mathbf{x})$ , is determined by the following relationship involving a reference policy $\pi_{\theta_{\text{ref}}}$ , a reward function $r(\mathbf{x}, \mathbf{y})$ , a positive temperature parameter $\beta$ , and a normalization term $Z(\mathbf{x})$ : $\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y}))}{Z(\mathbf{x})}$ Given this formula, what is the primary effect of significantly increasing the reward $r(\mathbf{x}, \mathbf{y})$ for a single, specific output $\mathbf{y}$ , while keeping all other factors constant?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related