1Cademy - Attention Score in Transformers ($$\beta

Learn Before

Introduce weight matrices in the transformer

Formula

Attention Score in Transformers ( $\beta_{i,j}$ )

The attention score, denoted as $\beta_{i,j}$ , is the intermediate value computed between a query vector $\mathbf{q}_i$ and a key vector $\mathbf{k}_j$ before any normalization is applied. This score calculation involves a scaled dot product with an optional masking variable, defined by the formula:

$\beta_{i,j} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d}} + \mathrm{Mask}(i,j)$

In this equation, $d$ represents the dimension of the key vectors, and $\mathrm{Mask}(i,j)$ is the masking variable for $(i,j)$ , utilized to optionally block certain positions from attending to others.