Concept

Probability-Based Supervision Signals for Reward Models

In addition to discrete preference labels, the probabilities associated with each label can be utilized as pointwise supervision signals for training a reward model. This is achieved by extracting the probability values for the specific label tokens (such as "A" and "B") from the language model's output. These extracted values are then re-normalized into a proper probability distribution over the labels using techniques like the Softmax function.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences