1Cademy - A reward model is trained to learn human preferences by minimizing the following loss function, which is an expectation over a preference dataset $\mathcal{D}$: <br><br>$\mathcal{L}(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\sim\mathcal{D}} [\log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})]$ <br><br>In this dataset, $\mathbf{y}_a$ represents a response preferred over response $\mathbf{y}_b$ for a given input $\mathbf{x}$. What is the primary effect of successfully minimizing this loss function on the models behavior?

Learn Before

Reward Model Loss as Negative Log-Likelihood

Multiple Choice

A reward model is trained to learn human preferences by minimizing the following loss function, which is an expectation over a preference dataset $\mathcal{D}$ :

$\mathcal{L}(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\sim\mathcal{D}} [\log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})]$

In this dataset, $\mathbf{y}_a$ represents a response preferred over response $\mathbf{y}_b$ for a given input $\mathbf{x}$ . What is the primary effect of successfully minimizing this loss function on the model's behavior?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related