Learn Before
Insufficiency of Data Fitting for Aligning with Human Values
Aligning LLMs with human values requires more than simply fitting the model to a limited dataset of annotated examples. Such datasets are often insufficient to capture the full spectrum of desired behaviors. The fundamental goal is not just to replicate specific outputs, but to instill in the model a deeper capability to discern which responses are more aligned with human preferences in general.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Learn After
An AI development team trains a large language model to be helpful and harmless. They create a massive dataset containing millions of examples of harmful user prompts, each paired with a safe, refusal-to-answer response (e.g., "I cannot fulfill this request."). After training, they find the model still generates subtly harmful or biased content in response to novel, cleverly phrased prompts that were not in the training data. Which of the following statements best analyzes the fundamental reason for the model's failure?
Critique of an LLM Alignment Strategy
Critique of a Data-Centric Alignment Strategy