Multiple Choice

A team is adapting a language model, originally pre-trained with a standard full attention mechanism, to handle tasks involving extremely long text sequences. Their strategy is to replace the full attention with a more computationally efficient sparse attention mechanism and then fine-tune the model on their long-context dataset. What is the primary reason for using the original model's parameters to initialize this new sparse-attention model, instead of starting the fine-tuning process with randomly initialized parameters?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science