Preference Data Sample for Reward Model Training
A single data point for training a reward model via pairwise comparison is a tuple sampled from the preference dataset . This tuple is represented as , where is the input prompt, and and are two distinct responses generated for that prompt. Within this tuple, one response is designated as preferred over the other based on human feedback (e.g., is preferred over ). This structure forms the basis for calculating the ranking loss.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Preference Data Sample for Reward Model Training
A development team aims to create a model that can judge the quality of different text outputs. They have a dataset where for each input prompt, two different generated outputs have been compared by a human, with one labeled as 'preferred' and the other as 'not preferred'. How should they configure the training process for their quality-judging model to effectively learn from this comparative data?
Evaluating a Reward Model Training Strategy
You are training a model to predict which of two AI-generated summaries of a news article a human would find more helpful. Arrange the following steps into the correct sequence for a single training iteration of this model.