Concept

Elimination of the Reward Model in DPO

A key advantage of Direct Policy Optimization (DPO) is the elimination of an explicit reward model during training. This is a direct consequence of re-expressing the preference probability in terms of policy ratios, a formulation where the intractable normalization factor Z(x) cancels out. As a result, preference probabilities can be calculated using only the target and reference policies, which streamlines the alignment process.

0

1

Updated 2026-01-15

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After