Learn Before
Evaluating the Design of T5's Unscaled Attention Mechanism
The standard scaled dot-product attention formula includes a scaling factor of 1/sqrt(d_k) applied to the query-key dot product to counteract the effect of large dot product magnitudes. The T5 model's attention mechanism, however, omits this scaling factor. Evaluate the potential motivations and consequences of this design choice. In your answer, discuss how the absence of this scaling factor might impact model training stability and performance, and how other components of the T5 architecture, such as the learnable relative position bias, might interact with or compensate for this change.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector
q_ifor a token at positioni, a key vectork_jfor a token at positionj, a key dimensiond_k, and the specific learnable biasu_{b(i-j)}for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?Analysis of T5 Attention Formula Modifications
Evaluating the Design of T5's Unscaled Attention Mechanism