Learn Before
Formula for Attention with T5 Bias (Unscaled)
In the T5 model, the attention score calculation deviates from the standard scaled dot-product attention by omitting the rescaling operation. By substituting the relative position encoding into the base attention formula, a shared, learnable scalar bias is added directly to the unscaled query-key dot product for vectors and . The resulting formula for the attention weight is: This modification, as specified by Raffel et al. (2020), removes the division by the square root of the key dimension, .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Offset Calculation for T5 Bias
Number of Buckets for T5 Bias Terms
Learned Parameters for T5 Bias
Generalization Advantage of T5 Bias through Parameter Sharing
Controlling Overfitting with T5 Bias Buckets
Formula for Attention with T5 Bias (Unscaled)
Consider a hypothetical self-attention model that uses a relative positional encoding scheme where every unique query-key offset (e.g., -5, -4, ..., 0, ..., 4, 5) is assigned its own distinct, learnable bias parameter. How does the T5 approach, which groups many different offsets into a limited number of 'buckets' that share a single parameter, represent a key improvement over this hypothetical scheme, especially for handling sequences longer than those seen during training?
Generalization of Relative Positional Bias
Choosing a Positional Encoding Scheme for Generalization
You are reviewing a proposal to extend a productio...
You’re debugging a long-context retrofit of a pret...
Your team is extending a pretrained Transformer fr...
Choosing and Justifying a Positional Retrofit Under Long-Context and Latency Constraints
Selecting a Positional Strategy for a Long-Context Retrofit
Diagnosing Long-Context Failures Across Positional Schemes
You’re reviewing three proposed positional mechani...
Long-Context Retrofit Decision: RoPE Base Scaling vs ALiBi vs T5 Relative Bias
Root-Cause Analysis of Long-Context Degradation After a Positional-Encoding Retrofit
Post-Retrofit Regression: Separating Positional-Method Effects from Scaling Choices
Learn After
A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector
q_ifor a token at positioni, a key vectork_jfor a token at positionj, a key dimensiond_k, and the specific learnable biasu_{b(i-j)}for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?Analysis of T5 Attention Formula Modifications
Evaluating the Design of T5's Unscaled Attention Mechanism