Learn Before
Probability-Based Supervision Signals for Reward Models
In addition to discrete preference labels, the probabilities associated with each label can be utilized as pointwise supervision signals for training a reward model. This is achieved by extracting the probability values for the specific label tokens (such as "A" and "B") from the language model's output. These extracted values are then re-normalized into a proper probability distribution over the labels using techniques like the Softmax function.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Preference Data Sample for Reward Model Training
A development team aims to create a model that can judge the quality of different text outputs. They have a dataset where for each input prompt, two different generated outputs have been compared by a human, with one labeled as 'preferred' and the other as 'not preferred'. How should they configure the training process for their quality-judging model to effectively learn from this comparative data?
Evaluating a Reward Model Training Strategy
You are training a model to predict which of two AI-generated summaries of a news article a human would find more helpful. Arrange the following steps into the correct sequence for a single training iteration of this model.
Probability-Based Supervision Signals for Reward Models