Formula

Preference Data Sample for Reward Model Training

A single data point for training a reward model via pairwise comparison is a tuple sampled from the preference dataset Dr\mathcal{D}_r. This tuple is represented as (x,yk1,yk2)(\mathbf{x}, \mathbf{y}_{k_1}, \mathbf{y}_{k_2}), where x\mathbf{x} is the input prompt, and yk1\mathbf{y}_{k_1} and yk2\mathbf{y}_{k_2} are two distinct responses generated for that prompt. Within this tuple, one response is designated as preferred over the other based on human feedback (e.g., yk1\mathbf{y}_{k_1} is preferred over yk2\mathbf{y}_{k_2}). This structure forms the basis for calculating the ranking loss.

Image 0

0

1

Updated 2026-04-20

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences