Concept

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is an alignment method that simplifies the training framework by eliminating the need to explicitly model rewards. Instead of developing a separate reward model—which can be difficult to train reliably and negatively impact policy learning if poorly trained—DPO directly optimizes the language model's policy based on human preferences. By doing so, it achieves human preference alignment in a straightforward, supervised learning-like fashion.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After