1Cademy - Example of Causal Attention Dot Products

Learn Before

Causal Self-Attention in Autoregressive Decoders

Example

Example of Causal Attention Dot Products

In a causal self-attention mechanism, a query at a given position i is restricted to attending only to keys at positions j that are less than or equal to i ( $j \le i$ ). This ensures that the prediction for a token depends only on the preceding tokens. The following list illustrates the specific query-key dot products that are computed for a sequence, demonstrating the lower-triangular pattern of attention scores before masking:

For query $\mathbf{q}_0$ : $\mathbf{q}_0 \mathbf{k}_0^T$
For query $\mathbf{q}_1$ : $\mathbf{q}_1 \mathbf{k}_0^T, \mathbf{q}_1 \mathbf{k}_1^T$
For query $\mathbf{q}_2$ : $\mathbf{q}_2 \mathbf{k}_0^T, \mathbf{q}_2 \mathbf{k}_1^T, \mathbf{q}_2 \mathbf{k}_2^T$
For query $\mathbf{q}_3$ : $\mathbf{q}_3 \mathbf{k}_0^T, \mathbf{q}_3 \mathbf{k}_1^T, \mathbf{q}_3 \mathbf{k}_2^T, \mathbf{q}_3 \mathbf{k}_3^T$