Examples of LLM-Generated Responses for RLHF Evaluation
In the data collection phase of Reinforcement Learning from Human Feedback (RLHF), an LLM generates multiple distinct outputs for a single prompt by sampling from its output space. For instance, given the prompt 'How can I live a more environmentally friendly life?', the model might produce the following set of four responses, mathematically denoted as , for human evaluation:
- Output 1 (): Consider switching to an electric vehicle or bicycle instead of traditional cars to reduce carbon emissions and protect our planet.
- Output 2 (): Adopt a minimalist lifestyle. Own fewer possessions to reduce consumption and the environmental impact of manufacturing and disposal.
- Output 3 (): Go off-grid. Generate your own renewable energy and collect rainwater to become completely self-sufficient and reduce reliance on non-renewable resources.
- Output 4 (): Support local farm products to reduce the carbon footprint of transporting food, while enjoying fresh, healthy food.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Examples of LLM-Generated Responses for RLHF Evaluation
Evaluating Strategies for Response Diversity
A research team is collecting data for a human feedback process. They find that their instruction-tuned model, despite sampling, consistently produces outputs that are very similar in structure and content for a given prompt. Which of the following strategies would be the most effective at introducing fundamentally different perspectives and conceptual variety into the generated responses?
Generation of Candidate Outputs from Input-Only Datasets in RLHF
A team is working on collecting a dataset for human feedback and wants to ensure a wide variety of model responses for each user request. Match each technique for increasing output diversity with the scenario that best exemplifies it.
Learn After
Evaluating AI-Generated Responses
A key step in gathering data for Reinforcement Learning from Human Feedback (RLHF) is to have a language model generate multiple, varied responses to a single prompt. Which of the following sets of responses to the prompt 'What are the benefits of regular exercise?' best exemplifies the desired diversity and quality for this data collection process?
In a data collection process where a language model generates multiple outputs for a single prompt to be evaluated by humans, the model was given the prompt: 'How can I improve my public speaking skills?'. It produced the following four responses. What is the primary weakness of this set of responses for its intended purpose?
- Response A: Practice your speech in front of a mirror to get comfortable with the material.
- Response B: Rehearse your presentation multiple times to build confidence.
- Response C: Run through your talk several times before the actual event.
- Response D: Join a local public speaking club to get feedback and practice in a supportive environment.