Calculating Pair-wise Ranking Loss
A reward model is being trained on a dataset of human preferences. For one specific data point in the dataset, the model is given a prompt (), a human-preferred response (), and a human-dispreferred response (). The model assigns the following scalar scores:
- Score for preferred response,
- Score for dispreferred response,
The loss for this single data point is calculated using the formula:
Where is the sigmoid function, , and is the natural logarithm.
Calculate the loss value for this specific data point. Explain the significance of the resulting loss value in the context of training this model. (You may use the approximation ).
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A model is being trained to evaluate text completions for a given prompt. The training data consists of pairs of completions for each prompt, where one is marked as 'preferred' () and the other as 'dispreferred' () by human reviewers. The model learns by minimizing the following loss function, averaged over all pairs in the dataset:
where is the sigmoid function and is the value the model assigns to a completion .
What is the primary effect of minimizing this loss function on the scores the model assigns?
Calculating Pair-wise Ranking Loss
Analyzing Reward Model Loss Behavior