Superficial Alignment Hypothesis
The superficial alignment hypothesis is a theory suggesting that an LLM's fundamental knowledge and capabilities are almost entirely established during pre-training. According to this view, the fine-tuning phase does not add significant new knowledge but rather performs a 'superficial' adjustment, aligning the model's existing abilities with specific user needs and instruction formats. This explains why alignment can be achieved with a relatively small amount of fine-tuning data and effort.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Fine-Tuning Pre-trained Models for Downstream Tasks
Instruction Fine-Tuning
Superficial Alignment Hypothesis
Challenge of Opaque Pre-Training Data in Fine-Tuning
A team develops a large language model pre-trained on a massive, diverse corpus of text from the internet. When initially tested on the task of generating concise summaries of legal documents, its performance is poor and unstructured. The team then collects a small, curated dataset of 500 legal documents and their corresponding expert-written summaries. After training the model on this small dataset, its ability to summarize new legal documents improves dramatically. Which statement best analyzes the role of this second training phase?
Critiquing a Model Training Hypothesis
Implicit Learning of Instruction-Response Mappings During Pre-training
Explaining the Impact of Targeted Training
Learn After
Interpreting LLM Training Observations
A research team observes that a large language model, pre-trained on a massive text corpus, requires a surprisingly small dataset of instruction-following examples to become a helpful assistant. According to the Superficial Alignment Hypothesis, what is the most accurate explanation for this observation?
Explaining the Role of Fine-Tuning