Activity (Process)

Converting Listwise Rankings to Pairwise Preferences for Reward Model Training

To train a reward model in RLHF, preference data collected as a full ranking (listwise) must often be converted into a pairwise format. For instance, a single ranked list like y1 ≻ y4 ≻ y2 ≻ y3 can be broken down into multiple pairwise comparisons, such as (y1, y4), (y1, y2), (y1, y3), (y4, y2), etc., where the first element is always preferred over the second. This process generates a dataset of (prompt, preferred_response, rejected_response) tuples, which is the standard input format for training the reward model using a pairwise ranking objective.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related