Learn Before
Example

Visual Representation of T5 Bias Application (nb=3, distmax=5)

This image visualizes the application of T5 relative position bias in a causal self-attention setting, using hyperparameters nb=3n_b = 3 and distmax=5\text{dist}_{\text{max}} = 5. The grid shows the matrix of query-key dot products (qikjTq_i k_j^T), where the lower triangular shape enforces causality, meaning a query at position ii attends only to keys at positions jij \le i. The bucketing rules for these hyperparameters are as follows: relative position offsets of 0 and 1 are mapped directly to buckets 0 (u0u_0) and 1 (u1u_1); offsets 2 and 3 are grouped into bucket 2 (u2u_2); and all offsets of 4 and greater are consolidated into bucket 3 (u3u_3). A shared bias parameter from the corresponding bucket is added to each dot product to compute the final attention score.

Image 0

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After