Hidden Size in Transformer Models
In Transformer architectures, the hidden size, denoted as , specifies the dimensionality of the input and output vectors for each sub-layer. Furthermore, the majority of the internal hidden states generated within these sub-layers are also -dimensional vectors. Because it determines the size of these internal representations, can generally be interpreted as a measure of the overall width of the network.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Hidden Size in Transformer Models
A machine learning engineer is designing a Transformer encoder for a complex language task. Their primary goal is to improve the model's ability to capture diverse and varied contextual relationships (e.g., syntactic, semantic, co-reference) from different parts of the input sequence simultaneously. Which hyperparameter adjustment would most directly address this specific goal?
Hyperparameter Tuning Trade-offs
An engineer is configuring a Transformer encoder. Match each key hyperparameter to its specific architectural role.
FFN Hidden Size in Transformers
Vocabulary Size in Transformers
Model Depth in Transformers
Number of Attention Heads
Embedding Size in Transformer Models
Embedding Size in Transformer Models
Evaluating Language Model Design Choices
A research team is tasked with building a language model to analyze a large collection of specialized legal contracts. These documents contain a unique vocabulary and sentence structure not commonly found in general web text. When deciding how to approach this task, which of the following considerations is the most critical to address first to ensure the model's effectiveness?
Trade-offs in Language Model Vocabulary Design
Hidden Size in Transformer Models
Number of Attention Heads
FFN Hidden Size in Transformers
Model Depth in Transformers
Vocabulary Size in Transformers
Learn After
A machine learning engineer is designing a neural network for a complex language task and decides to significantly increase the dimensionality of the vectors that are processed within the network's internal sub-layers. What is the most direct trade-off the engineer should expect from this change?
Impact of Hidden Size on Sub-Layer Dimensions
In a standard Transformer model's architecture, various components have specific dimensionalities defined by key hyperparameters. Match each component listed below with its correct dimensionality, using the following notation: represents the hidden size, is the size of the feed-forward network's inner layer, and is the number of attention heads.