Focusing on the causal attention mechanism, explain the primary reason for the observed linear growth in processing time for each new token.

Google

In autoregressive generation, the computational cost for the attention mechanism at a single step $i'$ is linear in the current sequence length, expressed as $O(i')$. This cost is primarily driven by two matrix-vector operations: the dot products between the current query vector $\mathbf{q}_{i'}$ and all previous key vectors (i.e., $\mathbf{q}_{i'}\mathbf{K}_{\le i'}^T$), and the subsequent weighted summation of the previous value vectors, which involves multiplying the Softmax output with the value matrix $\mathbf{V}_{\le i'}$.

Computational Cost per Token in Causal Attention

The overall time complexity for the self-attention mechanism when generating a sequence of length $$len$$ with an $$L$$-layer Transformer is $$O(L \times len^2)$$. This quadratic complexity arises from summing the linear computational cost ($$O(i')$$) for each of the $$len$$ generation steps. The total complexity is then multiplied by $$L$$, as this entire process is repeated for each layer in the Transformer stack.

Time Complexity of Self-Attention in Autoregressive Generation

In a model that generates text one token at a time, suppose it has already produced a sequence of length `N` and is now calculating the next token (at position `N+1`). Which of the following best identifies the two primary computational operations within the attention mechanism that cause the cost of this single step to scale linearly with the current sequence length `N`?

Analyzing Generation Latency

An autoregressive language model takes 50 milliseconds to compute the attention for the 100th token in a sequence. Based on the computational scaling of the causal attention mechanism at a single generation step, approximately how long would you expect it to take to compute the attention for the 400th token in the same sequence? Explain your reasoning.

Predicting Attention Computation Time

During autoregressive generation, calculating self-attention for a single new token over a context sequence of length `len` has a linear time complexity of $$O(L \times len)$$ across $$L$$ transformer layers. This cost arises because at each layer, the two primary matrix-vector operations—the dot products between the query and previous key vectors, and the weighted summation of previous value vectors—both scale linearly with `len`.

Learn Before

Related