Example

Visualization of Query-Key Dot Products in Causal Attention

In a causal attention mechanism, a query at a given position is only allowed to attend to keys at the same or preceding positions, preventing information flow from the future. This is implemented by computing dot products only between a query vector qi\mathbf{q}_i and key vectors kj\mathbf{k}_j where the key's index jj is less than or equal to the query's index ii. For a sequence of length 7 (indexed 0 to 6), the specific query-key dot products that are calculated are as follows:

  • For token 0: q0k0\mathbf{q}_0 \cdot \mathbf{k}_0
  • For token 1: q1k0,q1k1\mathbf{q}_1 \cdot \mathbf{k}_0, \mathbf{q}_1 \cdot \mathbf{k}_1
  • For token 2: q2k0,q2k1,q2k2\mathbf{q}_2 \cdot \mathbf{k}_0, \mathbf{q}_2 \cdot \mathbf{k}_1, \mathbf{q}_2 \cdot \mathbf{k}_2
  • For token 3: q3k0,q3k1,q3k2,q3k3\mathbf{q}_3 \cdot \mathbf{k}_0, \mathbf{q}_3 \cdot \mathbf{k}_1, \mathbf{q}_3 \cdot \mathbf{k}_2, \mathbf{q}_3 \cdot \mathbf{k}_3
  • For token 4: q4k0,q4k1,q4k2,q4k3,q4k4\mathbf{q}_4 \cdot \mathbf{k}_0, \mathbf{q}_4 \cdot \mathbf{k}_1, \mathbf{q}_4 \cdot \mathbf{k}_2, \mathbf{q}_4 \cdot \mathbf{k}_3, \mathbf{q}_4 \cdot \mathbf{k}_4
  • For token 5: q5k0,q5k1,q5k2,q5k3,q5k4,q5k5\mathbf{q}_5 \cdot \mathbf{k}_0, \mathbf{q}_5 \cdot \mathbf{k}_1, \mathbf{q}_5 \cdot \mathbf{k}_2, \mathbf{q}_5 \cdot \mathbf{k}_3, \mathbf{q}_5 \cdot \mathbf{k}_4, \mathbf{q}_5 \cdot \mathbf{k}_5
  • For token 6: q6k0,q6k1,q6k2,q6k3,q6k4,q6k5,q6k6\mathbf{q}_6 \cdot \mathbf{k}_0, \mathbf{q}_6 \cdot \mathbf{k}_1, \mathbf{q}_6 \cdot \mathbf{k}_2, \mathbf{q}_6 \cdot \mathbf{k}_3, \mathbf{q}_6 \cdot \mathbf{k}_4, \mathbf{q}_6 \cdot \mathbf{k}_5, \mathbf{q}_6 \cdot \mathbf{k}_6

This selective computation results in a lower triangular attention score matrix, which is fundamental to autoregressive models.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences