FFN Hidden Size in Transformers
The Feed-Forward Network (FFN) sub-layers within Transformer models feature a hidden layer with a specific size denoted as . This dimension is typically designed to be larger than the standard hidden size, . A common architectural setup sets . For more recent, larger-scale Transformers, can be assigned to an even larger value to boost capacity.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?
FFN Hidden Size in Transformers
Hidden Size in Transformer Models
A machine learning engineer is designing a Transformer encoder for a complex language task. Their primary goal is to improve the model's ability to capture diverse and varied contextual relationships (e.g., syntactic, semantic, co-reference) from different parts of the input sequence simultaneously. Which hyperparameter adjustment would most directly address this specific goal?
Hyperparameter Tuning Trade-offs
An engineer is configuring a Transformer encoder. Match each key hyperparameter to its specific architectural role.
FFN Hidden Size in Transformers
Vocabulary Size in Transformers
Model Depth in Transformers
Number of Attention Heads
Embedding Size in Transformer Models
Embedding Size in Transformer Models
Evaluating Language Model Design Choices
A research team is tasked with building a language model to analyze a large collection of specialized legal contracts. These documents contain a unique vocabulary and sentence structure not commonly found in general web text. When deciding how to approach this task, which of the following considerations is the most critical to address first to ensure the model's effectiveness?
Trade-offs in Language Model Vocabulary Design
Hidden Size in Transformer Models
Number of Attention Heads
FFN Hidden Size in Transformers
Model Depth in Transformers
Vocabulary Size in Transformers
Learn After
A team of engineers is designing a large neural network for a complex language task. Within each block of their model, they use a sub-network composed of two linear transformations with a non-linearity in between. They are debating whether to make the dimensionality of the intermediate layer in this sub-network significantly larger (e.g., four times larger) than the model's primary embedding and hidden state dimension. What is the primary trade-off they must consider when making this decision?
Optimizing Transformer Model Size
Calculating Parameter Impact of FFN Expansion