1Cademy - Preference Dataset Sampling Operation

Learn Before

Preference Data Sample for Reward Model Training

Concept

Preference Dataset Sampling Operation

When computing the loss for a reward model, $\mathcal{D}_r$ represents a set of tuples containing an input and a pair of outputs. The expression $(\mathbf{x},\mathbf{y}_{k_1},\mathbf{y}_{k_2}) \sim \mathcal{D}_r$ designates a sampling operation that draws a specific tuple from $\mathcal{D}_r$ according to a given probability. As an example of this sampling, a model input $\mathbf{x}$ could first be drawn using a uniform distribution, followed by drawing a pair of outputs based on the conditional probability that $\mathbf{y}_{k_1}$ is preferred over $\mathbf{y}_{k_2}$ given $\mathbf{x}$ . This probability is denoted mathematically as $\Pr(\mathbf{y}_{k_1} \succ \mathbf{y}_{k_2} | \mathbf{x})$ .

0

1

Updated 2026-04-20

Contributors are:

Who are from:

References

Learn After

Empirical Formulation of Pair-wise Ranking Loss

Learn Before

Related

Learn After