1Cademy - Unnormalized Target Distribution in the DPO Objective

Learn Before

Conceptual Objective Function Assumed in DPO

Concept

Unnormalized Target Distribution in the DPO Objective

In the rearranged Direct Preference Optimization (DPO) objective function, the fixed term that does not depend on the target policy $\theta$ , specifically $\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big)$ , is interpreted as an unnormalized probability distribution of $\mathbf{y}$ . This conceptual shift is introduced because evaluating the objective function as a divergence between two valid probability distributions is mathematically more intuitive. To formally convert this unnormalized function into a normalized probability distribution, it must be divided by a normalization factor.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related