Concept
Self-Attention Sequence Processing Complexity
In a self-attention mechanism processing a sequence of length , the query, key, and value matrices each have dimensions . The scaled dot-product attention computes the product of an matrix with a matrix, and then multiplies the resulting matrix by an matrix. This yields a total computational complexity of . Since every token is directly connected to every other token, the computation requires only sequential operations (enabling full parallelization), and the maximum path length is the shortest possible at .
0
1
Updated 2026-05-15
Tags
D2L
Dive into Deep Learning @ D2L