Concept

Unnormalized Target Distribution in the DPO Objective

In the rearranged Direct Preference Optimization (DPO) objective function, the fixed term that does not depend on the target policy θ\theta, specifically πθref(yx)exp(1βr(x,y))\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big), is interpreted as an unnormalized probability distribution of y\mathbf{y}. This conceptual shift is introduced because evaluating the objective function as a divergence between two valid probability distributions is mathematically more intuitive. To formally convert this unnormalized function into a normalized probability distribution, it must be divided by a normalization factor.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences