Learn Before
Preference Data Sample for Reward Model Training
A single data point for training a reward model via pairwise comparison is a tuple sampled from the preference dataset . This tuple is represented as , where is the input prompt, and and are two distinct responses generated for that prompt. Within this tuple, one response is designated as preferred over the other based on human feedback (e.g., is preferred over ). This structure forms the basis for calculating the ranking loss.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Preference Data Sample for Reward Model Training
A development team aims to create a model that can judge the quality of different text outputs. They have a dataset where for each input prompt, two different generated outputs have been compared by a human, with one labeled as 'preferred' and the other as 'not preferred'. How should they configure the training process for their quality-judging model to effectively learn from this comparative data?
Evaluating a Reward Model Training Strategy
You are training a model to predict which of two AI-generated summaries of a news article a human would find more helpful. Arrange the following steps into the correct sequence for a single training iteration of this model.
Probability-Based Supervision Signals for Reward Models
Learn After
Pair-wise Ranking Loss Formula for RLHF Reward Model
A team is creating a dataset to train a reward model. The model's objective is to learn to assign higher scores to helpful and detailed responses over unhelpful or overly brief ones. For the input prompt
x = 'Explain the water cycle.', which of the following data samples, represented as a tuple(prompt, chosen_response, rejected_response), would be the most effective and correctly structured training point for this objective?Constructing a Preference Data Sample from Human Feedback
A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.
Prompt: 'Summarize the plot of Hamlet in three sentences.' Response A: 'Hamlet is a play about a prince who seeks revenge for his father's murder. He feigns madness, confronts his mother, and duels his uncle's co-conspirator, leading to a tragic end for the royal family.' Response B: 'Hamlet is a famous play.'
Preference Dataset Sampling Operation