Activity (Process)

Discarding the Pre-training Head for Downstream Adaptation

After a model's parameters have been optimized during pre-training, the subsequent step in adapting the model is to remove the pre-training-specific output layer, denoted by the parameters W^\widehat{\mathbf{W}}. This layer is discarded because it is exclusively tailored to the pre-training objective and is not applicable to downstream tasks. Dropping it leaves the core pre-trained encoder, parameterized by θ^\hat{\theta}, ready to be either further fine-tuned or directly applied as a fixed feature extractor for new applications.

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related