Learn Before
Analysis of T5 Attention Formula Modifications
A standard attention mechanism calculates the unnormalized score between a query q_i and a key k_j using the expression (q_i ⋅ k_j) / sqrt(d_k), where d_k is the dimension of the key vectors. In contrast, the T5 model's approach uses the expression q_i ⋅ k_j + u_{b(i-j)}, where u is a learnable scalar bias dependent on the relative positions of i and j. Identify the two primary modifications in the T5 expression compared to the standard one, and explain the functional role of each change within the attention mechanism.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector
q_ifor a token at positioni, a key vectork_jfor a token at positionj, a key dimensiond_k, and the specific learnable biasu_{b(i-j)}for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?Analysis of T5 Attention Formula Modifications
Evaluating the Design of T5's Unscaled Attention Mechanism