1Cademy - A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.<br><br>**Prompt:** Summarize the plot of Hamlet in three sentences.<br>**Response A:** Hamlet is a play about a prin

Learn Before

Preference Data Sample for Reward Model Training

Matching

A human evaluator is presented with the following prompt and two responses. The evaluator chooses Response A as the better one. This interaction is used to create a single data point for training a reward model, structured as a tuple containing an input prompt (x), a preferred response (y_k1), and a rejected response (y_k2). Match each item below to its correct role in this data sample.

Prompt: 'Summarize the plot of Hamlet in three sentences.' Response A: 'Hamlet is a play about a prin

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related