Theory

Superficial Alignment Hypothesis

The superficial alignment hypothesis is a theory suggesting that an LLM's fundamental knowledge and capabilities are almost entirely established during pre-training. According to this view, the fine-tuning phase does not add significant new knowledge but rather performs a 'superficial' adjustment, aligning the model's existing abilities with specific user needs and instruction formats. This explains why alignment can be achieved with a relatively small amount of fine-tuning data and effort.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences