1Cademy - Probability-Based Supervision Signals for Reward Models

Learn Before

Training a Reward Model with Preference Data

Concept

Probability-Based Supervision Signals for Reward Models

In addition to discrete preference labels, the probabilities associated with each label can be utilized as pointwise supervision signals for training a reward model. This is achieved by extracting the probability values for the specific label tokens (such as "A" and "B") from the language model's output. These extracted values are then re-normalized into a proper probability distribution over the labels using techniques like the Softmax function.

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related