1Cademy - Visual Representation of T5 Bias Application (nb=3, distmax=5)

Learn Before

Synthesis of T5 Bias Bucketing Rules

Example

Visual Representation of T5 Bias Application (nb=3, distmax=5)

This image visualizes the application of T5 relative position bias in a causal self-attention setting, using hyperparameters $n_b = 3$ and $\text{dist}_{\text{max}} = 5$ . The grid shows the matrix of query-key dot products ( $q_i k_j^T$ ), where the lower triangular shape enforces causality, meaning a query at position $i$ attends only to keys at positions $j \le i$ . The bucketing rules for these hyperparameters are as follows: relative position offsets of 0 and 1 are mapped directly to buckets 0 ( $u_0$ ) and 1 ( $u_1$ ); offsets 2 and 3 are grouped into bucket 2 ( $u_2$ ); and all offsets of 4 and greater are consolidated into bucket 3 ( $u_3$ ). A shared bias parameter from the corresponding bucket is added to each dot product to compute the final attention score.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After