In a recurrent neural network (RNN) processing a sequence of length $$n$$, updating the $$d$$-dimensional hidden state involves a multiplication with a $$d 	imes d$$ weight matrix, leading to a computational cost of $$\mathcal{O}(d^2)$$ per step. Consequently, the total computational complexity for the entire recurrent layer is $$\mathcal{O}(nd^2)$$. Because the hidden state must be updated iteratively one token at a time, there are $$\mathcal{O}(n)$$ sequential operations that cannot be parallelized, and the maximum path length between any two tokens is similarly $$\mathcal{O}(n)$$.

RNN Sequence Processing Complexity

In a self-attention mechanism processing a sequence of length $$n$$, the query, key, and value matrices each have dimensions $$n 	imes d$$. The scaled dot-product attention computes the product of an $$n 	imes d$$ matrix with a $$d 	imes n$$ matrix, and then multiplies the resulting $$n 	imes n$$ matrix by an $$n 	imes d$$ matrix. This yields a total computational complexity of $$\mathcal{O}(n^2d)$$. Since every token is directly connected to every other token, the computation requires only $$\mathcal{O}(1)$$ sequential operations (enabling full parallelization), and the maximum path length is the shortest possible at $$\mathcal{O}(1)$$.

Self-Attention Sequence Processing Complexity

When comparing CNNs, RNNs, and self-attention for sequence tasks, each architecture presents distinct trade-offs. Both CNNs and self-attention support highly parallel computation due to their $$\mathcal{O}(1)$$ sequential operations, unlike RNNs. Additionally, self-attention provides the absolute shortest maximum path length of $$\mathcal{O}(1)$$, making it optimal for capturing long-range dependencies. However, because its computational complexity scales quadratically with the sequence length, $$\mathcal{O}(n^2d)$$, self-attention becomes prohibitively slow for very long sequences.

Trade-offs in Sequence Architecture Selection

When processing a sequence of length $$n$$, a one-dimensional Convolutional Neural Network (CNN) treats the text as a 'one-dimensional image', extracting local features such as $$n$$-grams. For a convolutional layer with a kernel size of $$k$$ and $$d$$ input and output channels, the computational complexity is $$\mathcal{O}(knd^2)$$. Because CNNs process local regions simultaneously and use a hierarchical structure, they require only $$\mathcal{O}(1)$$ sequential operations, enabling parallel computation. Furthermore, the hierarchical architecture allows the receptive field to expand; for instance, a two-layer CNN with $$k = 3$$ connects distant tokens like $$\mathbf{x}_1$$ and $$\mathbf{x}_5$$, resulting in a maximum path length of $$\mathcal{O}(n/k)$$.

CNN Sequence Processing Complexity

When evaluating architectures such as CNNs, RNNs, and self-attention for mapping an input sequence of $$n$$ tokens to an output sequence of the same length (with each token represented as a $$d$$-dimensional vector), three main properties are compared: computational complexity, sequential operations, and maximum path lengths. A smaller number of sequential operations is desirable as it allows for parallel computation, while a shorter maximum path length between tokens makes it easier for the network to learn long-range dependencies.

Claude

In the encoder, the data will first go through a module called ‘self-attention’ to get a weighted feature x.

Self-Attention of Transformer

Dive into Deep Learning

In self-attention mechanisms, masks dictate which tokens within a sequence are allowed to interact with one another. This can be conceptualized by distinguishing between valid attention, where information is permitted to flow between tokens, and blocked attention, where the interaction is explicitly suppressed. For example, when processing a sequence of tokens from $$x_0$$ to $$x_4$$, a specific mask might allow a token like $$x_1$$ to receive valid attention from $$x_0$$, $$x_2$$, and $$x_4$$, while assigning blocked attention to $$x_3$$. By using these masks, models can selectively control the contextual information available to each token.

Learn Before

Related

Learn After