Learn Before
Visual Representation of T5 Bias Application (nb=3, distmax=5)
This image visualizes the application of T5 relative position bias in a causal self-attention setting, using hyperparameters and . The grid shows the matrix of query-key dot products (), where the lower triangular shape enforces causality, meaning a query at position attends only to keys at positions . The bucketing rules for these hyperparameters are as follows: relative position offsets of 0 and 1 are mapped directly to buckets 0 () and 1 (); offsets 2 and 3 are grouped into bucket 2 (); and all offsets of 4 and greater are consolidated into bucket 3 (). A shared bias parameter from the corresponding bucket is added to each dot product to compute the final attention score.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Unified Formula for T5 Bias Bucketing
Example of T5 Bias Bucketing
Visual Representation of T5 Bias Application (nb=3, distmax=5)
A model designer is implementing a mechanism to account for the relative distance between tokens in a sequence. The proposed strategy uses a unique, learnable value for each of the first few relative distances (e.g., 1, 2, 3...), but then groups larger distances into a smaller set of shared values, with the size of these groups increasing as the distance grows. What is the primary trade-off this combined approach is designed to optimize?
Analysis of a Hybrid Positional Bucketing System
Formula for Applying T5 Relative Position Bias
Generalization Advantage of T5 Positional Bias
A model uses a hybrid strategy to handle relative positional distances between tokens, assigning each distance to one of a limited number of 'buckets'. The rules are:
- For small distances (e.g., 0-15), each distance is assigned to its own unique bucket.
- For medium distances, the ranges of distances assigned to a single bucket grow progressively larger as the distance increases.
- For very large distances (e.g., beyond 512), all are assigned to a single, final bucket.
Based on this system, which of the following distances is most likely to be assigned to the same bucket as the distance 40?
Learn After
In a causal self-attention mechanism, a relative position bias is added to the dot product of each query-key pair. The bias is determined by bucketing the relative position offset, which is calculated as (query index - key index). Given the following bucketing rules:
- Offset 0 → Bucket 0
- Offset 1 → Bucket 1
- Offsets 2 or 3 → Bucket 2
- Offsets 4 or greater → Bucket 3
Match each query-key pair below to the correct bias bucket that would be applied.
A causal self-attention mechanism uses a relative position bias. The bias is determined by bucketing the relative position offset (query index - key index) according to these rules:
- Offset 0 → Bucket 0
- Offset 1 → Bucket 1
- Offsets 2 or 3 → Bucket 2
- Offsets 4 or greater → Bucket 3
The following grid shows the calculated bias bucket index for each query-key pair in a sequence. One of the bucket indices is incorrect. Identify the query-key pair with the incorrectly calculated bias bucket.
(Key Index) 0 1 2 3 4 5 +------------------ (Q) 0 | 0 X X X X X (u) 1 | 1 0 X X X X (e) 2 | 2 1 0 X X X (r) 3 | 2 2 1 0 X X (y) 4 | 3 1 2 1 0 X 5 | 3 3 2 2 1 0
In a causal self-attention mechanism, a relative position bias is added to the dot product of each query-key pair. The bias is determined by bucketing the relative position offset (query index - key index) according to these rules:
- Offset 0 → Bucket 0
- Offset 1 → Bucket 1
- Offsets 2 or 3 → Bucket 2
- Offsets 4 or greater → Bucket 3
Which of the following statements accurately describes a structural property of the resulting bias matrix?