1Cademy - A reward model is being trained using a loss function that includes a regularization term to prevent its output scores from growing excessively large. The regularization component for a single pair of responses, $(\mathbf{y}_a, \mathbf{y}_b)$, to an input, $\mathbf{x}$, is calculated as $(r(\mathbf{x}, \mathbf{y}_a) + r(\mathbf{x}, \mathbf{y}_b))^2$, where $r$ is the reward score. A higher value for this term results in a larger penalty. Given the following four pairs of reward scores, which pai

Learn Before

Regularized Pairwise Loss Function for Reward Model Training

Multiple Choice

A reward model is being trained using a loss function that includes a regularization term to prevent its output scores from growing excessively large. The regularization component for a single pair of responses, $(\mathbf{y}_a, \mathbf{y}_b)$ , to an input, $\mathbf{x}$ , is calculated as $(r(\mathbf{x}, \mathbf{y}_a) + r(\mathbf{x}, \mathbf{y}_b))^2$ , where $r$ is the reward score. A higher value for this term results in a larger penalty. Given the following four pairs of reward scores, which pai

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related