Generalization Through Parameter Sharing
A language model that groups various query-key offsets and assigns a single shared parameter to each group often performs well on input sequences longer than any it saw during training. Explain the core reason why this parameter-sharing strategy enables the model to effectively handle these previously unseen, large offsets.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model was trained exclusively on text segments with a maximum length of 512 tokens. During inference, it must process a 1000-token document, encountering a query-key offset of 700 for the first time. Why is a model architecture that groups offsets into 'buckets' and shares a single learnable parameter per bucket better equipped to handle this novel offset than a hypothetical model that learns a unique, separate parameter for every individual offset?
Generalization Through Parameter Sharing
Diagnosing Generalization Failure in a Transformer Model