Example

Enumeration of Dot Products in Causal Self-Attention

In a causal self-attention mechanism, the query vector for a given token i, denoted as qi\mathbf{q}_i, only computes dot products with the key vectors of tokens up to and including its own position. This ensures that the prediction for token i does not depend on future tokens. The sequence of dot products computed is as follows:

  • For token 0: q0k0T\mathbf{q}_0 \mathbf{k}_0^T
  • For token 1: q1k0T,q1k1T\mathbf{q}_1 \mathbf{k}_0^T, \mathbf{q}_1 \mathbf{k}_1^T
  • For token 2: q2k0T,q2k1T,q2k2T\mathbf{q}_2 \mathbf{k}_0^T, \mathbf{q}_2 \mathbf{k}_1^T, \mathbf{q}_2 \mathbf{k}_2^T This pattern continues for subsequent tokens, where for any token i, the dot products qikjT\mathbf{q}_i \mathbf{k}_j^T are calculated for all jij \le i.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related