Learn Before
  • Training a Reward Model with Preference Data

Preference Data Sample for Reward Model Training

A single data point for training a reward model via pairwise comparison is a tuple sampled from the preference dataset Dr\mathcal{D}_r. This tuple is represented as (x,yk1,yk2)(\mathbf{x}, \mathbf{y}_{k_1}, \mathbf{y}_{k_2}), where x\mathbf{x} is the input prompt, and yk1\mathbf{y}_{k_1} and yk2\mathbf{y}_{k_2} are two distinct responses generated for that prompt. Within this tuple, one response is designated as preferred over the other based on human feedback (e.g., yk1\mathbf{y}_{k_1} is preferred over yk2\mathbf{y}_{k_2}). This structure forms the basis for calculating the ranking loss.

Image 0

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Preference Data Sample for Reward Model Training

  • A development team aims to create a model that can judge the quality of different text outputs. They have a dataset where for each input prompt, two different generated outputs have been compared by a human, with one labeled as 'preferred' and the other as 'not preferred'. How should they configure the training process for their quality-judging model to effectively learn from this comparative data?

  • Evaluating a Reward Model Training Strategy

  • You are training a model to predict which of two AI-generated summaries of a news article a human would find more helpful. Arrange the following steps into the correct sequence for a single training iteration of this model.

Learn After
  • Pair-wise Ranking Loss Formula for RLHF Reward Model

  • A team is creating a dataset to train a reward model. The model's objective is to learn to assign higher scores to helpful and detailed responses over unhelpful or overly brief ones. For the input prompt x = 'Explain the water cycle.', which of the following data samples, represented as a tuple (prompt, chosen_response, rejected_response), would be the most effective and correctly structured training point for this objective?

  • Constructing a Preference Data Sample from Human Feedback

  • A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.

    Prompt: 'Summarize the plot of Hamlet in three sentences.' Response A: 'Hamlet is a play about a prince who seeks revenge for his father's murder. He feigns madness, confronts his mother, and duels his uncle's co-conspirator, leading to a tragic end for the royal family.' Response B: 'Hamlet is a famous play.'