Formula

Direct Preference Optimization (DPO) Loss Function

The Direct Preference Optimization (DPO) method trains the target policy by minimizing a specific loss function that relies directly on preference data. It computes the negative log-likelihood of preference probabilities directly from the target and reference policies, skipping the intermediate reward model entirely. Over a preference dataset Dr\mathcal{D}_r, the objective is formulated as:

Ldpo(θ)=E(x,ya,yb)Dr[logPrθ(yaybx)]\mathcal{L}_{\mathrm{dpo}}(\theta) = -\mathbb{E}_{(\mathbf{x},\mathbf{y}_a,\mathbf{y}_b) \sim \mathcal{D}_r} \big[ \log \mathrm{Pr}_{\theta}(\mathbf{y}_a \succ \mathbf{y}_b | \mathbf{x}) \big]

By minimizing this loss, the policy parameters θ\theta are optimized to align with human preferences.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related