Learn Before
Computational Cost per Token in Causal Attention
In autoregressive generation, the computational cost for the attention mechanism at a single step is linear in the current sequence length, expressed as . This cost is primarily driven by two matrix-vector operations: the dot products between the current query vector and all previous key vectors (i.e., ), and the subsequent weighted summation of the previous value vectors, which involves multiplying the Softmax output with the value matrix .

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Computational Cost per Token in Causal Attention
Reusability of Key-Value Pairs in Autoregressive Inference
Example of Query-Key Interactions in Causal Attention
An autoregressive model is generating a sequence of tokens one by one. It is currently calculating the attention output for the token at position 4 (i.e., the fifth token in the sequence). To ensure the model only uses information it has already seen, which set of key (K) and value (V) vectors must be used as input to the attention mechanism for the query vector at position 4 (q₄)?
Diagnosing Information Leakage in an Autoregressive Model
When calculating the attention output for a specific token at position
iin an autoregressive model, the mechanism is structured to use the query vector from that same position (q_i), while the key and value matrices are composed of the corresponding vectors from all positions in the full input sequence.
Learn After
Time Complexity of Self-Attention in Autoregressive Generation
Claimed Linear Time Complexity of Self-Attention in Autoregressive Generation
In a model that generates text one token at a time, suppose it has already produced a sequence of length
Nand is now calculating the next token (at positionN+1). Which of the following best identifies the two primary computational operations within the attention mechanism that cause the cost of this single step to scale linearly with the current sequence lengthN?Analyzing Generation Latency
Predicting Attention Computation Time