MLP of the Vision Transformer Encoder
The multilayer perceptron (MLP) within the vision Transformer encoder introduces slight modifications to the positionwise feed-forward network (FFN) of the original Transformer architecture. Primarily, it utilizes the Gaussian Error Linear Unit (GELU) activation function, which serves as a smoother alternative to the standard ReLU. Additionally, to enhance regularization during training, dropout is systematically applied to the output of every fully connected layer within this MLP.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?
FFN Hidden Size in Transformers
Positionwise Nature of Transformer Feed-Forward Networks
MLP of the Vision Transformer Encoder