1Cademy - Direct Policy Optimization (DPO) Training Process

Learn Before

Human Preference Alignment via Reward Models

Activity (Process)

Direct Policy Optimization (DPO) Training Process

Direct Policy Optimization (DPO) presents a more direct method for aligning models with human preferences compared to traditional RLHF. The DPO process uses preference data, indicating a preferred response ( $\mathbf{y}_a$ ) over a rejected one ( $\mathbf{y}_b$ ), to directly update the policy. This is achieved by training the policy with a Maximum Likelihood Estimation (MLE) objective, which effectively bypasses the intermediate steps of explicitly training a reward model and using reinforcement learning.