1Cademy - Debugging a Causal Attention Implementation

Learn Before

Key Matrix for Causal Attention (K_≤i)

Short Answer

Debugging a Causal Attention Implementation

An engineer is implementing an autoregressive attention mechanism for a sequence of 10 tokens (indexed 0 to 9). When calculating the attention for the token at position i=4, they construct a key matrix that includes the key vectors from all 10 positions (k_0 through k_9). Explain why this approach is incorrect for this type of mechanism. What should be the correct composition of the key matrix for position i=4?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related