Multiple Choice

In a policy-based language model alignment process, the reward r(x, y) for a response y to a prompt x is defined by the equation: r(x,y)=β(logπθ(yx)πθref(yx)+logZ(x))r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right) where π_θ is the target policy, π_θ_ref is the reference policy, β is a positive scaling factor, and Z(x) is a normalization factor. If, for a specific response y_1, the target policy assigns a lower probability than the reference policy (i.e., π_θ(y_1|x) < π_θ_ref(y_1|x)), what is the direct consequence for the log-ratio component of the reward calculation?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related