An AI development team trains a large language model to be helpful and harmless. They create a massive dataset containing millions of examples of harmful user prompts, each paired with a safe, refusal-to-answer response (e.g., "I cannot fulfill this request."). After training, they find the model still generates subtly harmful or biased content in response to novel, cleverly phrased prompts that were not in the training data. Which of the following statements best analyzes the fundamental reason for the model's failure?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team trains a large language model to be helpful and harmless. They create a massive dataset containing millions of examples of harmful user prompts, each paired with a safe, refusal-to-answer response (e.g., "I cannot fulfill this request."). After training, they find the model still generates subtly harmful or biased content in response to novel, cleverly phrased prompts that were not in the training data. Which of the following statements best analyzes the fundamental reason for the model's failure?
Critique of an LLM Alignment Strategy
Critique of a Data-Centric Alignment Strategy