Positionwise Nature of Transformer Feed-Forward Networks
In a Transformer architecture, the feed-forward network is called positionwise because it applies the identical Multi-Layer Perceptron (MLP) to transform the representation at every sequence position independently. For an input tensor with the shape (batch size, number of time steps, number of hidden units), this two-layer MLP processes each time step's vector in isolation. Consequently, only the innermost dimension is transformed, resulting in an output tensor of shape (batch size, number of time steps, ). Because the exact same MLP transforms all positions, identical inputs at different positions will produce identical outputs.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?
FFN Hidden Size in Transformers
Positionwise Nature of Transformer Feed-Forward Networks
MLP of the Vision Transformer Encoder