Short Answer

Impact of Data Distribution on Reward Model Training

A team is training a reward model using a dataset of 10,000 preference pairs. They notice that 2,000 of these pairs are for the single prompt, 'Write a story about a robot,' while the remaining 8,000 pairs are distributed across 4,000 other unique prompts. Given the standard empirical loss formula used for this training:

Lr(ϕ)=1Dr(x,ya,yb)DrlogPrϕ(yaybx)\mathcal{L}_r(\phi) = -\frac{1}{|\mathcal{D}_r|} \sum_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\in\mathcal{D}_r} \log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})

Analyze the most likely consequence of this data distribution on the trained reward model's behavior, and explain how the structure of the formula leads to this outcome.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science