1Cademy - Interpreting DPO Preference Probabilities and Pipeline Implications from Logged Policy Ratios

Learn Before

Case Study

Interpreting DPO Preference Probabilities and Pipeline Implications from Logged Policy Ratios

You are reviewing an internal alignment experiment where the team claims they implemented Direct Policy Optimization (DPO) to replace an RLHF-with-PPO pipeline. They trained a target policy π_θ using a fixed dataset of human preference pairs (x, y_chosen, y_rejected) and a fixed reference policy π_ref. For one prompt x, the training logs show the following values computed from model log-probabilities:

A = log( π_θ(y_chosen|x) / π_ref(y_chosen|x) ) = +0.20 B = log( π_θ(y_rejected|x) / π_ref(y_rejected|x) ) = +0.80 β = 2

The team also proposes adding an online loop that periodically samples new responses from π_θ, scores them with a learned reward model, and appends them to the dataset “to make DPO work better.”

As the reviewer, analyze this situation: (1) Using the DPO preference-probability form based on policy ratios, determine whether the logged values imply the model currently assigns a preference probability above or below 0.5 to y_chosen over y_rejected for this x, and briefly justify using the sign/magnitude of the log-ratio difference (no need to compute an exact sigmoid value). (2) Based on what makes DPO different from RLHF-with-PPO, evaluate whether the proposed online reward-model sampling loop is consistent with DPO’s core training pipeline and its characterization as offline RL, and explain the key tradeoff introduced by adopting that proposal.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related