1Cademy - Discarding the Pre-training Head for Downstream Adaptation

Learn Before

Applying and Adapting Pre-trained Models to Downstream Tasks

Activity (Process)

Discarding the Pre-training Head for Downstream Adaptation

After a model's parameters have been optimized during pre-training, the subsequent step in adapting the model is to remove the pre-training-specific output layer, denoted by the parameters $\widehat{\mathbf{W}}$ . This layer is discarded because it is exclusively tailored to the pre-training objective and is not applicable to downstream tasks. Dropping it leaves the core pre-trained encoder, parameterized by $\hat{\theta}$ , ready to be either further fine-tuned or directly applied as a fixed feature extractor for new applications.