1Cademy - A researcher is training a reward model using a small preference dataset, $\mathcal{D}_r$, which contains exactly two preference pairs: 1. For input $\mathbf{x}_1$, response $\mathbf{y}_{1a}$ is preferred over $\mathbf{y}_{1b}$. 2. For input $\mathbf{x}_2$, response $\mathbf{y}_{2a}$ is preferred over $\mathbf{y}_{2b}$. Given the empirical loss formula $\mathcal{L}_r(\phi) = -\frac{1}{|\mathcal{D}_r|} \sum_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\in\mathcal{D}_r} \log \text{Pr}

Learn Before

Empirical Reward Model Loss Formula

Multiple Choice

A researcher is training a reward model using a small preference dataset, $\mathcal{D}_r$ , which contains exactly two preference pairs:

For input $\mathbf{x}_1$ , response $\mathbf{y}_{1a}$ is preferred over $\mathbf{y}_{1b}$ .
For input $\mathbf{x}_2$ , response $\mathbf{y}_{2a}$ is preferred over $\mathbf{y}_{2b}$ .

Given the empirical loss formula $\mathcal{L}_r(\phi) = -\frac{1}{|\mathcal{D}r|} \sum{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\in\mathcal{D}r} \log \text{Pr}{\phi}(\mathbf{y

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related