Activity (Process)

Direct Policy Optimization (DPO) Training Process

Direct Policy Optimization (DPO) presents a more direct method for aligning models with human preferences compared to traditional RLHF. The DPO process uses preference data, indicating a preferred response (ya\mathbf{y}_a) over a rejected one (yb\mathbf{y}_b), to directly update the policy. This is achieved by training the policy with a Maximum Likelihood Estimation (MLE) objective, which effectively bypasses the intermediate steps of explicitly training a reward model and using reinforcement learning.

Image 0

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences