Learn Before
Preference Dataset Sampling Operation
When computing the loss for a reward model, represents a set of tuples containing an input and a pair of outputs. The expression designates a sampling operation that draws a specific tuple from according to a given probability. As an example of this sampling, a model input could first be drawn using a uniform distribution, followed by drawing a pair of outputs based on the conditional probability that is preferred over given . This probability is denoted mathematically as .
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Pair-wise Ranking Loss Formula for RLHF Reward Model
A team is creating a dataset to train a reward model. The model's objective is to learn to assign higher scores to helpful and detailed responses over unhelpful or overly brief ones. For the input prompt
x = 'Explain the water cycle.', which of the following data samples, represented as a tuple(prompt, chosen_response, rejected_response), would be the most effective and correctly structured training point for this objective?Constructing a Preference Data Sample from Human Feedback
A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.
Prompt: 'Summarize the plot of Hamlet in three sentences.' Response A: 'Hamlet is a play about a prince who seeks revenge for his father's murder. He feigns madness, confronts his mother, and duels his uncle's co-conspirator, leading to a tragic end for the royal family.' Response B: 'Hamlet is a famous play.'
Preference Dataset Sampling Operation