1Cademy - A language model is being fine-tuned using a dataset of prompts (x), preferred responses (y_a), and dispreferred responses (y_b). The training objective is to minimize the following loss function: $$ \mathcal{L} = -\mathbb{E}_{(x, y_a, y_b)} [\log \text{Pr}(y_a \succ y_b|x)] $$ In this framework, the probability that response y_a is preferred over y_b, denoted as $\text{Pr}(y_a \succ y_b|x)$, is computed directly from the likelihoods of each response under the current policy being trained and a fixed reference policy. Based on this formulation, what is the most significant advantage of this training approach?

Learn Before

Direct Preference Optimization (DPO) Loss Function

Multiple Choice

A language model is being fine-tuned using a dataset of prompts (x), preferred responses (y_a), and dispreferred responses (y_b). The training objective is to minimize the following loss function:

$\mathcal{L} = -\mathbb{E}_{(x, y_a, y_b)} [\log \text{Pr}(y_a \succ y_b|x)]$

In this framework, the probability that response y_a is preferred over y_b, denoted as $\text{Pr}(y_a \succ y_b|x)$ , is computed directly from the likelihoods of each response under the current policy being trained and a fixed reference policy.

Based on this formulation, what is the most significant advantage of this training approach?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related