Learn Before
Concept

Computational Cost per Token in Causal Attention

In autoregressive generation, the computational cost for the attention mechanism at a single step ii' is linear in the current sequence length, expressed as O(i)O(i'). This cost is primarily driven by two matrix-vector operations: the dot products between the current query vector qi\mathbf{q}_{i'} and all previous key vectors (i.e., qiKiT\mathbf{q}_{i'}\mathbf{K}_{\le i'}^T), and the subsequent weighted summation of the previous value vectors, which involves multiplying the Softmax output with the value matrix Vi\mathbf{V}_{\le i'}.

Image 0

0

1

Updated 2026-01-15

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences