Example

Example of Causal Attention Dot Products

In a causal self-attention mechanism, a query at a given position i is restricted to attending only to keys at positions j that are less than or equal to i (jij \le i). This ensures that the prediction for a token depends only on the preceding tokens. The following list illustrates the specific query-key dot products that are computed for a sequence, demonstrating the lower-triangular pattern of attention scores before masking:

  • For query q0\mathbf{q}_0: q0k0T\mathbf{q}_0 \mathbf{k}_0^T
  • For query q1\mathbf{q}_1: q1k0T,q1k1T\mathbf{q}_1 \mathbf{k}_0^T, \mathbf{q}_1 \mathbf{k}_1^T
  • For query q2\mathbf{q}_2: q2k0T,q2k1T,q2k2T\mathbf{q}_2 \mathbf{k}_0^T, \mathbf{q}_2 \mathbf{k}_1^T, \mathbf{q}_2 \mathbf{k}_2^T
  • For query q3\mathbf{q}_3: q3k0T,q3k1T,q3k2T,q3k3T\mathbf{q}_3 \mathbf{k}_0^T, \mathbf{q}_3 \mathbf{k}_1^T, \mathbf{q}_3 \mathbf{k}_2^T, \mathbf{q}_3 \mathbf{k}_3^T

This pattern continues for the entire sequence length.

0

1

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences