Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
The standard Feed-Forward Network (FFN) used in Transformer sub-layers is a two-layer network. Its operation can be expressed by the formula: In this equation, is the input vector of dimension . The components are defined with the following dimensions:
- First Layer: A linear transformation with a weight matrix and a bias vector , followed by a non-linear activation function .
- Second Layer: A linear transformation with a weight matrix and a bias vector .
The dimension of the hidden layer, , is typically larger than the input/output dimension . A common choice for the activation function in the hidden layer is the Rectified Linear Unit (ReLU).

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
FFN Hidden Size () in Transformers
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?