Concept

Self-Attention Sequence Processing Complexity

In a self-attention mechanism processing a sequence of length nn, the query, key, and value matrices each have dimensions nimesdn imes d. The scaled dot-product attention computes the product of an nimesdn imes d matrix with a dimesnd imes n matrix, and then multiplies the resulting nimesnn imes n matrix by an nimesdn imes d matrix. This yields a total computational complexity of O(n2d)\mathcal{O}(n^2d). Since every token is directly connected to every other token, the computation requires only O(1)\mathcal{O}(1) sequential operations (enabling full parallelization), and the maximum path length is the shortest possible at O(1)\mathcal{O}(1).

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L