Multiple Choice

A reward model is being trained using a loss function that includes a regularization term to prevent its output scores from growing excessively large. The regularization component for a single pair of responses, (ya,yb)(\mathbf{y}_a, \mathbf{y}_b), to an input, x\mathbf{x}, is calculated as (r(x,ya)+r(x,yb))2(r(\mathbf{x}, \mathbf{y}_a) + r(\mathbf{x}, \mathbf{y}_b))^2, where rr is the reward score. A higher value for this term results in a larger penalty. Given the following four pairs of reward scores, which pair would incur the largest penalty from this specific regularization term?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related