1Cademy - In a policy-based language model alignment process, the reward `r(x, y)` for a response `y` to a prompt `x` is defined by the equation: $$ r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right) $$ where `π_θ` is the target policy, `π_θ_ref` is the reference policy, `β` is a positive scaling factor, and `Z(x)` is a normalization factor. If, for a specific response `y_1`, the target policy assigns a lower probability than the reference policy (i.e., `π_θ(y_1|x) < π_θ_ref(y_1|x)`), what is the direct consequence for the log-ratio component of the reward calculation?

Learn Before

Reward Function in Terms of Policy Models and Normalization Factor

Multiple Choice

In a policy-based language model alignment process, the reward r(x, y) for a response y to a prompt x is defined by the equation: $r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right)$ where π_θ is the target policy, π_θ_ref is the reference policy, β is a positive scaling factor, and Z(x) is a normalization factor. If, for a specific response y_1, the target policy assigns a lower probability than the reference policy (i.e., π_θ(y_1|x) < π_θ_ref(y_1|x)), what is the direct consequence for the log-ratio component of the reward calculation?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related