Critique of a Data-Centric Alignment Strategy
A technology company announces a new strategy to create a perfectly 'honest' LLM. Their plan is to build an exhaustive dataset containing millions of factual statements and train the model to only output information that is verifiable within this dataset. For any prompt that requires information outside the dataset, the model is trained to respond, 'I do not have enough information to answer that.' Critically evaluate this strategy. In your evaluation, explain why this data-fitting approach is likely insufficient to achieve the broader goal of genuine honesty in an AI.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team trains a large language model to be helpful and harmless. They create a massive dataset containing millions of examples of harmful user prompts, each paired with a safe, refusal-to-answer response (e.g., "I cannot fulfill this request."). After training, they find the model still generates subtly harmful or biased content in response to novel, cleverly phrased prompts that were not in the training data. Which of the following statements best analyzes the fundamental reason for the model's failure?
Critique of an LLM Alignment Strategy
Critique of a Data-Centric Alignment Strategy