Learn Before
Short Answer

Debugging a Causal Attention Implementation

An engineer is implementing an autoregressive attention mechanism for a sequence of 10 tokens (indexed 0 to 9). When calculating the attention for the token at position i=4, they construct a key matrix that includes the key vectors from all 10 positions (k_0 through k_9). Explain why this approach is incorrect for this type of mechanism. What should be the correct composition of the key matrix for position i=4?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science