1Cademy - A model is being trained to evaluate text completions for a given prompt. The training data consists of pairs of completions for each prompt, where one is marked as preferred ($y_a$) and the other as dispreferred ($y_b$) by human reviewers. The model learns by minimizing the following loss function, averaged over all pairs in the dataset:<br><br>$\mathcal{L} = - \log \sigma(score(y_a) - score(y_b))$<br><br>where $\sigma$ is the sigmoid function and $score(y)$ is the value the model assigns to a completi

Learn Before

Empirical Pair-wise Ranking Loss for RLHF Reward Model

Multiple Choice

A model is being trained to evaluate text completions for a given prompt. The training data consists of pairs of completions for each prompt, where one is marked as 'preferred' ( $y_a$ ) and the other as 'dispreferred' ( $y_b$ ) by human reviewers. The model learns by minimizing the following loss function, averaged over all pairs in the dataset:

$\mathcal{L} = - \log \sigma(score(y_a) - score(y_b))$

where $\sigma$ is the sigmoid function and $score(y)$ is the value the model assigns to a completi

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related