1Cademy - Consider a single data point `(x, y_a, y_b)` from a preference dataset, where `y_a` is the preferred response and `y_b` is the dispreferred response. In a training framework that directly optimizes a policy `π_θ` against a fixed reference policy `π_ref` by maximizing the log-probability of the preference data, if the policy `π_θ` currently assigns an equal likelihood to both responses (i.e., `π_θ(y_a|x) = π_θ(y_b|x)`), the loss contribution from this data point will be zero.

Learn Before

Direct Preference Optimization (DPO) Loss Function

True/False

Consider a single data point (x, y_a, y_b) from a preference dataset, where y_a is the preferred response and y_b is the dispreferred response. In a training framework that directly optimizes a policy π_θ against a fixed reference policy π_ref by maximizing the log-probability of the preference data, if the policy π_θ currently assigns an equal likelihood to both responses (i.e., π_θ(y_a|x) = π_θ(y_b|x)), the loss contribution from this data point will be zero.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related