Empirical Pair-wise Ranking Loss for RLHF Reward Model
The reward model in RLHF is trained by minimizing an empirical pair-wise ranking loss, which is calculated as an average over the human preference dataset. This loss function encourages the model to assign a higher score to a preferred response () over a less preferred one () for the same input prompt. The formula, which is based on the Bradley-Terry model, is:
Here, is the preference dataset, is the reward model with parameters , and is the sigmoid function.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Empirical Formulation of Pair-wise Ranking Loss
Empirical Pair-wise Ranking Loss for RLHF Reward Model
Regularized Pairwise Loss Function for Reward Model Training
A reward model is being trained to prefer one machine-generated text response over another for a given input. The training process aims to minimize a loss function calculated as the negative logarithm of a sigmoid applied to the difference between the reward scores of the preferred () and non-preferred () responses. Given the following reward scores assigned by the model to a single pair of responses, which scenario contributes the least to the total loss, indicating the model is correctly differentiating between the responses?
Diagnosing Reward Model Training Issues
Analyzing Reward Model Performance via Loss Function
Learn After
A model is being trained to evaluate text completions for a given prompt. The training data consists of pairs of completions for each prompt, where one is marked as 'preferred' () and the other as 'dispreferred' () by human reviewers. The model learns by minimizing the following loss function, averaged over all pairs in the dataset:
where is the sigmoid function and is the value the model assigns to a completion .
What is the primary effect of minimizing this loss function on the scores the model assigns?
Calculating Pair-wise Ranking Loss
Analyzing Reward Model Loss Behavior