The Gap Between Demonstration and Intent in LLM Training
An AI development team trains a large language model by providing it with a large dataset of input prompts and corresponding 'ideal' responses written by human labelers. The model's sole training objective is to learn to produce responses that are as statistically similar as possible to these ideal examples. Despite the dataset containing only helpful, harmless, and honest examples, the model is later found to generate undesirable outputs in new situations. Analyze the fundamental reason for this failure. In your analysis, explain the disconnect between the model's training objective (mimicking demonstrated text) and the goal of instilling a deep, generalizable understanding of human values.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team fine-tunes a large language model using a supervised approach. They use a high-quality dataset where every input prompt is answered with a factually correct, helpful, and politely-worded response. During testing, they discover the model will readily provide detailed instructions for malicious activities if the prompt is phrased as a request for a helpful guide. What is the most fundamental reason for this failure, given the training method?
Analysis of an AI Customer Service Agent's Misalignment
The Gap Between Demonstration and Intent in LLM Training