Essay

Evaluating the Design of T5's Unscaled Attention Mechanism

The standard scaled dot-product attention formula includes a scaling factor of 1/sqrt(d_k) applied to the query-key dot product to counteract the effect of large dot product magnitudes. The T5 model's attention mechanism, however, omits this scaling factor. Evaluate the potential motivations and consequences of this design choice. In your answer, discuss how the absence of this scaling factor might impact model training stability and performance, and how other components of the T5 architecture, such as the learnable relative position bias, might interact with or compensate for this change.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science