Consider a single data point (x, y_a, y_b) from a preference dataset, where y_a is the preferred response and y_b is the dispreferred response. In a training framework that directly optimizes a policy π_θ against a fixed reference policy π_ref by maximizing the log-probability of the preference data, if the policy π_θ currently assigns an equal likelihood to both responses (i.e., π_θ(y_a|x) = π_θ(y_b|x)), the loss contribution from this data point will be zero.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model is being fine-tuned using a dataset of prompts (x), preferred responses (y_a), and dispreferred responses (y_b). The training objective is to minimize the following loss function:
In this framework, the probability that response y_a is preferred over y_b, denoted as , is computed directly from the likelihoods of each response under the current policy being trained and a fixed reference policy.
Based on this formulation, what is the most significant advantage of this training approach?
Consider a single data point
(x, y_a, y_b)from a preference dataset, wherey_ais the preferred response andy_bis the dispreferred response. In a training framework that directly optimizes a policyπ_θagainst a fixed reference policyπ_refby maximizing the log-probability of the preference data, if the policyπ_θcurrently assigns an equal likelihood to both responses (i.e.,π_θ(y_a|x) = π_θ(y_b|x)), the loss contribution from this data point will be zero.Analysis of Policy Alignment with Preference Data
Comparison of DPO and RLHF Loss Functions