1Cademy - Formula for Attention with T5 Bias (Unscaled)

Learn Before

T5 Bias for Relative Positional Embedding

Formula

Formula for Attention with T5 Bias (Unscaled)

In the T5 model, the attention score calculation deviates from the standard scaled dot-product attention by omitting the rescaling operation. By substituting the relative position encoding $\mathrm{PE}(i,j) = u_{b(i-j)}$ into the base attention formula, a shared, learnable scalar bias $u_{b(i-j)}$ is added directly to the unscaled query-key dot product for vectors $\mathbf{q}_i$ and $\mathbf{k}_j$ . The resulting formula for the attention weight is: $\alpha(i, j) = \text{Softmax}\left(\mathbf{q}_i \mathbf{k}_j^T + u_{b(i-j)} + \text{Mask}(i, j)\right)$ This modification, as specified by Raffel et al. (2020), removes the division by the square root of the key dimension, $\sqrt{d}$ .

0

1

Updated 2026-04-24

Contributors are:

Who are from:

References

Learn Before

Related

Learn After