Multiple Choice

In a framework for aligning language models, a reward function is defined as: r(x,y)=β(logπθ(yx)πθref(yx)+logZ(x))r(\mathbf{x}, \mathbf{y}) = \beta \left( \log \frac{\pi_{\theta}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})} + \log Z(\mathbf{x}) \right) where πθ\pi_{\theta} is the target policy, πθref\pi_{\theta_{\text{ref}}} is a reference policy, β\beta is a scaling factor, and Z(x)Z(\mathbf{x}) is a normalization factor dependent on the prompt x\mathbf{x}. Given two distinct responses, ya\mathbf{y}_a and yb\mathbf{y}_b, to the same prompt x\mathbf{x}, which expression correctly represents the difference in their rewards, r(x,ya)r(x,yb)r(\mathbf{x}, \mathbf{y}_a) - r(\mathbf{x}, \mathbf{y}_b)?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related