Learn Before
Constructing a Preference Data Sample from Human Feedback
A human labeler is tasked with creating a preference data sample. They are given a prompt and two generated responses. After reviewing them, the labeler provides a rationale for their choice. Based on the information below, construct the correctly formatted data tuple (x, y_preferred, y_rejected) that would be used to train a reward model.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Pair-wise Ranking Loss Formula for RLHF Reward Model
A team is creating a dataset to train a reward model. The model's objective is to learn to assign higher scores to helpful and detailed responses over unhelpful or overly brief ones. For the input prompt
x = 'Explain the water cycle.', which of the following data samples, represented as a tuple(prompt, chosen_response, rejected_response), would be the most effective and correctly structured training point for this objective?Constructing a Preference Data Sample from Human Feedback
A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.
Prompt: 'Summarize the plot of Hamlet in three sentences.' Response A: 'Hamlet is a play about a prince who seeks revenge for his father's murder. He feigns madness, confronts his mother, and duels his uncle's co-conspirator, leading to a tragic end for the royal family.' Response B: 'Hamlet is a famous play.'
Preference Dataset Sampling Operation