Mahabadi, R. K., Ruder, S., Dehghani, M., & Henderson, J. (2021). Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489.

University of Michigan - Ann Arbor

Besides standard fine-tuning, there are also some useful fine-tuning strategies.

- Two-stage fine-tuning: It introduces an intermediate stage between pretraining and fine-tuning. In the first stage, the PTM is transferred into a model fine-tuned by an intermediate task or corpus. In the second stage, the transferred model is fine-tuned to the target task.

- Multi-task fine-tuning: Fine-tuning the model across multiple tasks allows sharing information between the different tasks and positive transfer to other related tasks.

- Fine-tuning with extra adaptation modules: The main drawback of fine-tuning is its parameter inefficiency: every downstream task has its own fine-tuned parameters. Therefore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Fine-Tuning Strategies

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

A more efficient method for fine-tuning parameters of very large-scale models. 
\By fine-tuning only a fraction of the parameters of a very large model, the cost of computation and storage is greatly reduced.

Delta Tuning

Instruction fine-tuning is an adaptation method used to activate the general linguistic knowledge acquired during pre-training for new tasks. This is achieved by slightly adjusting a pre-trained model's parameters using a dataset composed of instruction-following data, which contains instructions and their corresponding correct responses.

Instruction Fine-Tuning

Given the following scenario, which fine-tuning strategy would be the most suitable to address the primary challenge? Justify your answer by explaining how the chosen strategy works and why it is the best fit for this situation.

Selecting an Efficient Fine-Tuning Strategy

A research lab needs to adapt a single, very large pre-trained language model (100B+ parameters) for 50 different, highly specialized downstream tasks. Their primary constraint is minimizing storage and computational costs, as creating and storing 50 full copies of the fine-tuned model is not feasible. Which fine-tuning strategy would be the most effective solution to this specific problem?

A development team is exploring different methods to adapt a large pre-trained language model for various applications. Match each of the following scenarios with the most appropriate fine-tuning strategy.

Learn Before

Related