Example of Listwise Ranking in RLHF
A practical instance of listwise ranking in Reinforcement Learning from Human Feedback (RLHF) involves human experts ordering multiple model-generated outputs for a single prompt. For example, if a dataset sample contains a set of four generated outputs, denoted as , an expert might order them from most preferred to least preferred. One possible ranking could be , which indicates that is the best response, followed sequentially by , , and finally .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Ch.4 Alignment - Foundations of Large Language Models
Related
Example of a Human Preference Ranking in RLHF
Listwise Loss from Accumulated Pairwise Comparisons
Plackett-Luce Model for Listwise Ranking
Example of Listwise Ranking in RLHF
A team is developing a language model to generate compelling short story endings. To gather human feedback, they generate four different endings for each story prompt. They are considering two feedback collection strategies:
Strategy 1: Human annotators are shown all four endings at once and asked to order them from best to worst.
Strategy 2: Human annotators are shown each of the four endings one at a time and asked to rate its quality on a scale of 1 to 10.
Based on the goal of collecting the most reliable data for model improvement, which strategy is generally more effective and why?
Improving Feedback Collection for a Chatbot
When using a listwise ranking approach to collect human feedback for a language model, the primary task for an annotator is to assign an independent numerical quality score (e.g., 1 to 10) to each of the model's generated outputs.
Example of a Human Preference Ranking in RLHF
Ranked Preference Notation
Example of Listwise Ranking in RLHF
A language model generates two different summaries for a given article: Summary 1 and Summary 2. A human evaluator is tasked with reviewing them and determines that Summary 1 is more coherent and factually accurate than Summary 2. How would this specific judgment be formally expressed using standard preference notation?
A human annotator provides the following judgments for four text completions (C1, C2, C3, C4) generated in response to a single prompt: C1 ≻ C4, C4 ≻ C2, and C2 ≻ C3. Based on this information, arrange the completions in order from most preferred to least preferred.
Limitations of Preference Notation
Learn After
A team is refining a language model. For a single user prompt, the model generates four distinct responses: Response 1, Response 2, Response 3, and Response 4. A human evaluator is tasked with ordering these responses from best to worst. The evaluator concludes that Response 3 is the most helpful. Response 1 is the second-best, followed by Response 4. Response 2 is deemed the least helpful. Using the notation where '≻' signifies 'is preferred over,' which option correctly represents the evaluator's complete ranking?
A human evaluator was asked to rank three different responses (Response A, Response B, Response C) generated by a language model for the same prompt. Match each formal preference notation with the correct description of the evaluator's ranking.
Interpreting Evaluator Preferences