Learn Before
A model needs to represent the relative distance between elements in a long sequence using a limited number of shared parameters (buckets). The model's designers have determined that precise distance is important for nearby elements, but for elements that are far apart, a less precise, general sense of distance is sufficient. Which bucketing strategy best balances parameter efficiency with this modeling requirement?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Formula for Logarithmic Bucketing in T5 Bias
Final Bucket for Offsets Exceeding dist_max in T5 Bias
Parameter Efficiency for Long-Range Dependencies
A model needs to represent the relative distance between elements in a long sequence using a limited number of shared parameters (buckets). The model's designers have determined that precise distance is important for nearby elements, but for elements that are far apart, a less precise, general sense of distance is sufficient. Which bucketing strategy best balances parameter efficiency with this modeling requirement?
In a model that uses logarithmic bucketing for large relative position offsets, it is plausible that the same learned bias parameter would be applied to an offset of 500 as to an offset of 510, while offsets of 10 and 20 would likely receive distinct bias parameters.