1Cademy - Visualization of Query-Key Dot Products in Causal Attention

Learn Before

Causal Attention Weight Matrix Calculation

Example

Visualization of Query-Key Dot Products in Causal Attention

In a causal attention mechanism, a query at a given position is only allowed to attend to keys at the same or preceding positions, preventing information flow from the future. This is implemented by computing dot products only between a query vector $\mathbf{q}_i$ and key vectors $\mathbf{k}_j$ where the key's index $j$ is less than or equal to the query's index $i$ . For a sequence of length 7 (indexed 0 to 6), the specific query-key dot products that are calculated are as follows:

For token 0: $\mathbf{q}_0 \cdot \mathbf{k}_0$
For token 1: $\mathbf{q}_1 \cdot \mathbf{k}_0, \mathbf{q}_1 \cdot \mathbf{k}_1$
For token 2: $\mathbf{q}_2 \cdot \mathbf{k}_0, \mathbf{q}_2 \cdot \mathbf{k}_1, \mathbf{q}_2 \cdot \mathbf{k}_2$
For token 3: $\mathbf{q}_3 \cdot \mathbf{k}_0, \mathbf{q}_3 \cdot \mathbf{k}_1, \mathbf{q}_3 \cdot \mathbf{k}_2, \mathbf{q}_3 \cdot \mathbf{k}_3$
For token 4: $\mathbf{q}_4 \cdot \mathbf{k}_0, \mathbf{q}_4 \cdot \mathbf{k}_1, \mathbf{q}_4 \cdot \mathbf{k}_2, \mathbf{q}_4 \cdot \mathbf{k}_3, \mathbf{q}_4 \cdot \mathbf{k}_4$
For token 5: $\mathbf{q}_5 \cdot \mathbf{k}_0, \mathbf{q}_5 \cdot \mathbf{k}_1, \mathbf{q}_5 \cdot \mathbf{k}_2, \mathbf{q}_5 \cdot \mathbf{k}_3, \mathbf{q}_5 \cdot \mathbf{k}_4, \mathbf{q}_5 \cdot \mathbf{k}_5$
For token 6: $\mathbf{q}_6 \cdot \mathbf{k}_0, \mathbf{q}_6 \cdot \mathbf{k}_1, \mathbf{q}_6 \cdot \mathbf{k}_2, \mathbf{q}_6 \cdot \mathbf{k}_3, \mathbf{q}_6 \cdot \mathbf{k}_4, \mathbf{q}_6 \cdot \mathbf{k}_5, \mathbf{q}_6 \cdot \mathbf{k}_6$