1Cademy - Preference Data Sample for Reward Model Training

Learn Before

Training a Reward Model with Preference Data

Formula

Preference Data Sample for Reward Model Training

A single data point for training a reward model via pairwise comparison is a tuple sampled from the preference dataset $\mathcal{D}_r$ . This tuple is represented as $(\mathbf{x}, \mathbf{y}_{k_1}, \mathbf{y}_{k_2})$ , where $\mathbf{x}$ is the input prompt, and $\mathbf{y}_{k_1}$ and $\mathbf{y}_{k_2}$ are two distinct responses generated for that prompt. Within this tuple, one response is designated as preferred over the other based on human feedback (e.g., $\mathbf{y}_{k_1}$ is preferred over $\mathbf{y}_{k_2}$ ). This structure forms the basis for calculating the ranking loss.