Formula

Attention Score in Transformers (βi,j\beta_{i,j})

The attention score, denoted as βi,j\beta_{i,j}, is the intermediate value computed between a query vector qi\mathbf{q}_i and a key vector kj\mathbf{k}_j before any normalization is applied. This score calculation involves a scaled dot product with an optional masking variable, defined by the formula:

βi,j=qikjd+Mask(i,j)\beta_{i,j} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d}} + \mathrm{Mask}(i,j)

In this equation, dd represents the dimension of the key vectors, and Mask(i,j)\mathrm{Mask}(i,j) is the masking variable for (i,j)(i,j), utilized to optionally block certain positions from attending to others.

Image 0

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences