In a self-attention mechanism processing a sequence of length $$n$$, the query, key, and value matrices each have dimensions $$n 	imes d$$. The scaled dot-product attention computes the product of an $$n 	imes d$$ matrix with a $$d 	imes n$$ matrix, and then multiplies the resulting $$n 	imes n$$ matrix by an $$n 	imes d$$ matrix. This yields a total computational complexity of $$\mathcal{O}(n^2d)$$. Since every token is directly connected to every other token, the computation requires only $$\mathcal{O}(1)$$ sequential operations (enabling full parallelization), and the maximum path length is the shortest possible at $$\mathcal{O}(1)$$.

Claude

When evaluating architectures such as CNNs, RNNs, and self-attention for mapping an input sequence of $$n$$ tokens to an output sequence of the same length (with each token represented as a $$d$$-dimensional vector), three main properties are compared: computational complexity, sequential operations, and maximum path lengths. A smaller number of sequential operations is desirable as it allows for parallel computation, while a shorter maximum path length between tokens makes it easier for the network to learn long-range dependencies.

Comparing CNN, RNN, and Self-Attention Architectures

Dive into Deep Learning

In a recurrent neural network (RNN) processing a sequence of length $$n$$, updating the $$d$$-dimensional hidden state involves a multiplication with a $$d 	imes d$$ weight matrix, leading to a computational cost of $$\mathcal{O}(d^2)$$ per step. Consequently, the total computational complexity for the entire recurrent layer is $$\mathcal{O}(nd^2)$$. Because the hidden state must be updated iteratively one token at a time, there are $$\mathcal{O}(n)$$ sequential operations that cannot be parallelized, and the maximum path length between any two tokens is similarly $$\mathcal{O}(n)$$.

RNN Sequence Processing Complexity

Self-Attention Sequence Processing Complexity

When comparing CNNs, RNNs, and self-attention for sequence tasks, each architecture presents distinct trade-offs. Both CNNs and self-attention support highly parallel computation due to their $$\mathcal{O}(1)$$ sequential operations, unlike RNNs. Additionally, self-attention provides the absolute shortest maximum path length of $$\mathcal{O}(1)$$, making it optimal for capturing long-range dependencies. However, because its computational complexity scales quadratically with the sequence length, $$\mathcal{O}(n^2d)$$, self-attention becomes prohibitively slow for very long sequences.

Trade-offs in Sequence Architecture Selection

When processing a sequence of length $$n$$, a one-dimensional Convolutional Neural Network (CNN) treats the text as a 'one-dimensional image', extracting local features such as $$n$$-grams. For a convolutional layer with a kernel size of $$k$$ and $$d$$ input and output channels, the computational complexity is $$\mathcal{O}(knd^2)$$. Because CNNs process local regions simultaneously and use a hierarchical structure, they require only $$\mathcal{O}(1)$$ sequential operations, enabling parallel computation. Furthermore, the hierarchical architecture allows the receptive field to expand; for instance, a two-layer CNN with $$k = 3$$ connects distant tokens like $$\mathbf{x}_1$$ and $$\mathbf{x}_5$$, resulting in a maximum path length of $$\mathcal{O}(n/k)$$.

Learn Before

Related