Formula

Formula for Attention with T5 Bias (Unscaled)

In the T5 model, the attention score calculation deviates from the standard scaled dot-product attention by omitting the rescaling operation. By substituting the relative position encoding PE(i,j)=ub(ij)\mathrm{PE}(i,j) = u_{b(i-j)} into the base attention formula, a shared, learnable scalar bias ub(ij)u_{b(i-j)} is added directly to the unscaled query-key dot product for vectors qi\mathbf{q}_i and kj\mathbf{k}_j. The resulting formula for the attention weight is: α(i,j)=Softmax(qikjT+ub(ij)+Mask(i,j))\alpha(i, j) = \text{Softmax}\left(\mathbf{q}_i \mathbf{k}_j^T + u_{b(i-j)} + \text{Mask}(i, j)\right) This modification, as specified by Raffel et al. (2020), removes the division by the square root of the key dimension, d\sqrt{d}.

Image 0

0

1

Updated 2026-04-24

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related