1Cademy - Direct Preference Optimization (DPO)

Learn Before

Policy Learning in RLHF

Concept

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is an alignment method that simplifies the training framework by eliminating the need to explicitly model rewards. Instead of developing a separate reward model—which can be difficult to train reliably and negatively impact policy learning if poorly trained—DPO directly optimizes the language model's policy based on human preferences. By doing so, it achieves human preference alignment in a straightforward, supervised learning-like fashion.