Multiple Choice

A reward model is trained to learn human preferences by minimizing the following loss function, which is an expectation over a preference dataset D\mathcal{D}:

L(ϕ)=E(x,ya,yb)D[logPrϕ(yaybx)]\mathcal{L}(\phi) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b)\sim\mathcal{D}} [\log \text{Pr}_{\phi}(\mathbf{y}_a \succ \mathbf{y}_b|\mathbf{x})]

In this dataset, ya\mathbf{y}_a represents a response preferred over response yb\mathbf{y}_b for a given input x\mathbf{x}. What is the primary effect of successfully minimizing this loss function on the model's behavior?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science