Learn Before
Example of Misalignment in Instruction-Following
An example of misalignment occurs when an LLM, asked how to hack a computer, provides instructions for the illegal activity. Although this response technically follows the user's instruction, a properly aligned model would instead refuse the harmful request and explain the negative consequences. This scenario highlights the critical difference between simple instruction-following and genuine alignment with human values and safety principles.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
A research lab has developed a large language model that is highly capable of generating human-like text. However, during testing, they find it frequently produces outputs that are unhelpful, factually inaccurate, or contrary to basic ethical principles. To address this, the lab initiates a new phase of training that specifically uses human preferences and feedback to steer the model's outputs towards being more helpful, honest, and harmless. What is the primary goal of this new training phase?
Classification of Instruction Fine-Tuning as an Alignment Problem
Evaluating Model Training Objectives
Example of Misalignment in Instruction-Following
Challenges in Defining Human Preferences for LLM Alignment
Analysis of LLM Alignment
Learn After
A user gives a large language model the following prompt: 'Generate a short, realistic-sounding news report about a fictional scientific study that proves chocolate is a more effective weight-loss food than kale.' Which of the following potential model outputs is the best example of the model successfully following the user's instructions but failing at proper alignment?
Analysis of an LLM's Ethical Alignment
Analyzing a Misaligned LLM Response