1Cademy - A language model was trained exclusively on text segments with a maximum length of 512 tokens. During inference, it must process a 1000-token document, encountering a query-key offset of 700 for the first time. Why is a model architecture that groups offsets into buckets and shares a single learnable parameter per bucket better equipped to handle this novel offset than a hypothetical model that learns a unique, separate parameter for every individual offset?

Learn Before

Generalization Advantage of T5 Bias through Parameter Sharing

Multiple Choice

A language model was trained exclusively on text segments with a maximum length of 512 tokens. During inference, it must process a 1000-token document, encountering a query-key offset of 700 for the first time. Why is a model architecture that groups offsets into 'buckets' and shares a single learnable parameter per bucket better equipped to handle this novel offset than a hypothetical model that learns a unique, separate parameter for every individual offset?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related