1Cademy - Computational Cost per Token in Causal Attention

Learn Before

Causal Attention Input Structure

Concept

Computational Cost per Token in Causal Attention

In autoregressive generation, the computational cost for the attention mechanism at a single step $i'$ is linear in the current sequence length, expressed as $O(i')$ . This cost is primarily driven by two matrix-vector operations: the dot products between the current query vector $\mathbf{q}_{i'}$ and all previous key vectors (i.e., $\mathbf{q}_{i'}\mathbf{K}_{\le i'}^T$ ), and the subsequent weighted summation of the previous value vectors, which involves multiplying the Softmax output with the value matrix $\mathbf{V}_{\le i'}$ .