1Cademy - Generalization Advantage of T5 Bias through Parameter Sharing

Learn Before

T5 Bias for Relative Positional Embedding

Concept

Generalization Advantage of T5 Bias through Parameter Sharing

The T5 relative positional bias model is capable of generalizing to sequences longer than those encountered during training. This ability stems from its strategy of sharing the same learnable parameter across similar query-key offsets. Such parameter sharing is particularly effective because large offsets are rare in training data, allowing the model to apply learned biases to novel distances by grouping them with familiar ones.

Updated 2026-04-24

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

A language model was trained exclusively on text segments with a maximum length of 512 tokens. During inference, it must process a 1000-token document, encountering a query-key offset of 700 for the first time. Why is a model architecture that groups offsets into 'buckets' and shares a single learnable parameter per bucket better equipped to handle this novel offset than a hypothetical model that learns a unique, separate parameter for every individual offset?
Generalization Through Parameter Sharing
Diagnosing Generalization Failure in a Transformer Model

Learn Before

Related

Learn After