Learn Before
An alignment algorithm calculates the probability of a preferred response y_a over a dispreferred response y_b for a given prompt x using the following expression:
Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )
Based on a direct analysis of this expression, which of the following components is not explicitly required to compute this probability during the training process?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model alignment method re-expresses the probability of a preferred response (y_a) over a dispreferred response (y_b) for a given prompt (x) as follows:
Pr(y_a ≻ y_b | x) = Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )Where
π_θis the policy being trained andπ_refis a fixed reference policy. Based on this mathematical formulation, what is the primary reason this method can be trained without an explicit, separately-trained reward model?Mechanism of Reward Model Elimination
An alignment algorithm calculates the probability of a preferred response
y_aover a dispreferred responsey_bfor a given promptxusing the following expression:Sigmoid( β log( π_θ(y_a|x) / π_ref(y_a|x) ) - β log( π_θ(y_b|x) / π_ref(y_b|x) ) )Based on a direct analysis of this expression, which of the following components is not explicitly required to compute this probability during the training process?
Your team must choose an alignment approach for an...
Your team is implementing preference-based alignme...
Your team is reviewing two proposed alignment impl...
In a preference-based LLM alignment project, your ...
Selecting and Justifying DPO vs. RLHF for Preference Alignment Under Operational Constraints
Explaining DPO’s Objective as Offline RL Without a Reward Model: A Pipeline and Math-Based Justification
Diagnosing a “Missing Reward Model” DPO Implementation and Its Offline Implications
Post-Deployment Alignment Update: Choosing Between DPO and RLHF Under Logging and Compute Constraints
Interpreting DPO Preference Probabilities and Pipeline Implications from Logged Policy Ratios
Choosing an Alignment Pipeline and Debugging a DPO Objective Under Compute and Data Constraints