1Cademy - Evaluating the Design of T5s Unscaled Attention Mechanism

Learn Before

Formula for Attention with T5 Bias (Unscaled)

Essay

Evaluating the Design of T5's Unscaled Attention Mechanism

The standard scaled dot-product attention formula includes a scaling factor of 1/sqrt(d_k) applied to the query-key dot product to counteract the effect of large dot product magnitudes. The T5 model's attention mechanism, however, omits this scaling factor. Evaluate the potential motivations and consequences of this design choice. In your answer, discuss how the absence of this scaling factor might impact model training stability and performance, and how other components of the T5 architecture, such as the learnable relative position bias, might interact with or compensate for this change.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related