FFN Hidden Size () in Transformers
In Transformer models, the hidden layer of the Feed-Forward Network (FFN) sub-layer has a specific size, denoted as . This dimension is typically larger than the model's overall hidden size, . A common practice is to set . For larger-scale Transformers, the value of can be substantially increased to enhance model capacity.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
FFN Hidden Size () in Transformers
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?
Hidden Size in Transformer Models
FFN Hidden Size () in Transformers
A machine learning engineer is designing a Transformer encoder for a complex language task. Their primary goal is to improve the model's ability to capture diverse and varied contextual relationships (e.g., syntactic, semantic, co-reference) from different parts of the input sequence simultaneously. Which hyperparameter adjustment would most directly address this specific goal?
Hyperparameter Tuning Trade-offs
An engineer is configuring a Transformer encoder. Match each key hyperparameter to its specific architectural role.