Learn Before
Consider a transformer model's attention mechanism that uses a set of 'buckets' to store shared parameters for relative positions. For small, non-negative distances between a query and a key, a direct one-to-one correspondence is used where the bucket index is identical to the distance. Based on this rule, an interaction between a query at position 5 and a key at position 2 would be assigned to bucket index 3.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Formula for One-to-One Mapping in T5 Bias Bucketing
In a transformer model that uses a relative position bias mechanism, a specific set of initial 'buckets' is used to store shared bias parameters. For small, non-negative relative distances between a query and a key, there is a direct correspondence where the bucket index is identical to the distance. If a query is at position 8 and a key is at position 5, what is the index of the bucket used for their interaction?
Consider a transformer model's attention mechanism that uses a set of 'buckets' to store shared parameters for relative positions. For small, non-negative distances between a query and a key, a direct one-to-one correspondence is used where the bucket index is identical to the distance. Based on this rule, an interaction between a query at position 5 and a key at position 2 would be assigned to bucket index 3.
In a specific attention mechanism, shared parameters for interactions between tokens are stored in 'buckets' based on the distance between them. For the first several buckets, a simple rule applies: the bucket index is identical to the distance. If the distance between two tokens is 4, the interaction parameter will be retrieved from bucket number ____.