1Cademy - Enumeration of Dot Products in Causal Self-Attention

Learn Before

Example

Enumeration of Dot Products in Causal Self-Attention

In a causal self-attention mechanism, the query vector for a given token i, denoted as $\mathbf{q}_i$ , only computes dot products with the key vectors of tokens up to and including its own position. This ensures that the prediction for token i does not depend on future tokens. The sequence of dot products computed is as follows:

For token 0: $\mathbf{q}_0 \mathbf{k}_0^T$
For token 1: $\mathbf{q}_1 \mathbf{k}_0^T, \mathbf{q}_1 \mathbf{k}_1^T$
For token 2: $\mathbf{q}_2 \mathbf{k}_0^T, \mathbf{q}_2 \mathbf{k}_1^T, \mathbf{q}_2 \mathbf{k}_2^T$ This pattern continues for subsequent tokens, where for any token i, the dot products $\mathbf{q}_i \mathbf{k}_j^T$ are calculated for all $j \le i$ .

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After